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Centrifuge: Integrated Lease Management and Partitioning 
for Cloud Services 


Atul Adya', John Dunagan*, Alec Wolman* 
‘Google, *Microsoft Research 


Abstract: Making cloud services responsive is critical 
to providing a compelling user experience. Many large- 
scale sites, including LinkedIn, Digg and Facebook, ad- 
dress this need by deploying pools of servers that oper- 
ate purely on in-memory state. Unfortunately, current 
technologies for partitioning requests across these in- 
memory server pools, such as network load balancers, 
lead to a frustrating programming model where requests 
for the same state may arrive at different servers. Leases 
are a well-known technique that can provide a better pro- 
gramming model by assigning each piece of state to a 
single server. However, in-memory server pools host an 
extremely large number of items, and granting a lease 
per item requires fine-grained leasing that is not sup- 
ported in prior datacenter lease managers. 

This paper presents Centrifuge, a datacenter lease 
manager that solves this problem by integrating parti- 
tioning and lease management. Centrifuge consists of a 
set of libraries linked in by the in-memory servers and 
a replicated state machine that assigns responsibility for 
data items (including leases) to these servers. Centrifuge 
has been implemented and deployed in production as 
part of Microsoft’s Live Mesh, a large-scale commercial 
cloud service in continuous operation since April 2008. 
When cloud services within Mesh were built using Cen- 
trifuge, they required fewer lines of code and did not need 
to introduce their own subtle protocols for distributed 
consistency. As cloud services become ever more com- 
plicated, this kind of reduction in complexity is an in- 
creasingly urgent need. 


1 Introduction 


Responsiveness is critical to delivering compelling 
cloud services. Many large-scale sites, including 
LinkedIn, Digg and Facebook, address the simultane- 
ous needs of scale and low-latency by partitioning their 
user data across pools of servers that operate purely on 
in-memory state [36, 34, 35, 21, 23]. Processing most 
operations directly out of memory yields low latency re- 
sponses. These sites achieve reliability by using some 
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separate service (such as a replicated database) to reload 
the data into the server pool in the event of a failure. 


Unfortunately, current technologies for building in- 
memory server pools lead to a frustrating programming 
model. Many sites use load balancers to distribute re- 
quests across such servers pools, but load balancers force 
the programmer to handle difficult corner cases: requests 
for the same state may arrive at different servers, lead- 
ing to multiple potentially inconsistent versions. For ex- 
ample, in a cloud-based video conferencing service, the 
data items being partitioned might be metadata for in- 
dividual video conferences, such as the address of that 
conference’s rendezvous server. Inconsistencies can lead 
to users selecting different rendezvous points, and thus 
being unable to connect even when they are both online. 
The need to deal with these inconsistencies drastically in- 
creases the burden on service programmers. In our video 
conferencing example, the developer could reduce the 
chance that two nodes would fail to rendezvous by im- 
plementing quorum reads and writes on the in-memory 
servers. In other cases, programmers are faced with 
supporting application-specific reconciliation, a problem 
that is known to be difficult [39, 33, 17]. 


This paper describes Centrifuge, a system for building 
in-memory server pools that eliminates most of the dis- 
tributed systems complexity, allowing service program- 
mers to focus on the logic particular to their service. Cen- 
trifuge does this by implementing both lease manage- 
ment and partitioning using a replicated state machine. 
Leases are a well-known technique for ensuring that only 
one server at a time is responsible for a given piece of 
state [3]. Partitioning refers to assigning each piece of 
state to an in-memory server; requests are then sent to the 
appropriate server. To support partitioning, the replicated 
state machine implements a membership service and dy- 
namic load management. Partitioning for in-memory 
servers additionally requires a mechanism to deal with 
state loss. Centrifuge addresses this need with explicit 
API support for recovery: it notifies the service indi- 
cating which state has been lost and needs to be recov- 
ered, e.g., because a machine crashed and lost its lease. 
Centrifuge does not recover the state itself so that appli- 
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e.g., recovering from a variety of datacenter storage sys- 
tems or even relying on clients to re-publish state into the 
system. Relying on client republishing is the approach 
taken by the Live Mesh services [22], and we describe 
this in more detail in Section 4. This combination of 
functionality allows Centrifuge to replace most datacen- 
ter load balancers and simultaneously provide a simpler 
programming model. 

Providing both lease management and partitioning 
is valuable to the application developer, but it leads 
to a scalability challenge in implementing Centrifuge. 
Each in-memory server may hold hundreds of millions 
of items, and there may be hundreds of such servers. 
Naively supporting fine-grained leases (one for each 
item) allows flexible load management, but it could re- 
quire a large number of servers dedicated solely to lease 
traffic. In contrast, integrating leasing and partitioning 
allows Centrifuge to provide the benefits of fine-grained 
leases without their associated scalability costs. 

In particular, integrating leasing and partitioning al- 
lows Centrifuge to incorporate the following techniques: 
leases on variable-length ranges, manager-directed leas- 
ing, and conveying the partitioning assignment through 
leases. Variable-length ranges specify contiguous por- 
tions of a flat namespace that are assigned as a sin- 
gle lease. Internally, Centrifuge’s partitioning algorithm 
uses consistent hashing to determine the variable-length 
ranges. Manager-directed leasing avoids the problem of 
lease fragmentation. It allows the manager to change the 
length of the ranges being leased so that load can be shed 
at fine granularities, while simultaneously keeping the 
number of leases small. This is in contrast to the tra- 
ditional model where clients request leases from a man- 
ager, which can potentially degenerate into requiring one 
lease for each item. Manager-directed leasing also leads 
to changes in the leasing API: instead of clients request- 
ing individual leases, they simply ask which leases they 
have been assigned. Finally, because the lease and par- 
titioning assignments are both being performed by the 
manager, there is no need for separate protocols for these 
two tasks: the lease protocol implicitly conveys the re- 
sults of the partitioning algorithm. 

Centrifuge has been implemented and deployed in 
production as part of Microsoft’s Live Mesh, a large- 
scale commercial cloud service in continuous operation 
since April 2008 [22]. As of March 2009, it is in active 
use by five Live Mesh component services spanning hun- 
dreds of servers. As we describe in Section 4, Centrifuge 
successfully hid most of the distributed systems com- 
plexity from the developers of these component services: 
the services were built with fewer lines of code and with- 
out needing to introduce their own subtle protocols for 
distributed consistency or application-specific reconcili- 
ation. As cloud services become ever more complicated, 
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reducing this kind of complexity is an increasingly ur- 
gent need. 
To summarize, this paper’s main contributions are: 


e we demonstrate that integrating leasing and partition- 
ing can provide the benefits of fine-grained leases to 
in-memory server pools without their associated scal- 
ability costs; 


e we show that real-world cloud services written by 
other developers are simplified by using Centrifuge; 
and 


e we provide performance results from Centrifuge in 
production as well as a testbed evaluation. 


The remainder of this paper is organized as follows: 
In Section 2, we describe the design and implementation 
of Centrifuge. In Section 3, we explain the Centrifuge 
API through an example application. In Section 4, we 
describe how three real-world cloud services were sim- 
plified using Centrifuge. In Section 5, we report on the 
behavior of Centrifuge in production and we evaluate 
Centrifuge on a testbed. In Section 6, we describe re- 
lated work. In Section 7, we conclude. 


2 Design and Implementation 


The design of Centrifuge is motivated by the needs of 
in-memory server pools. Centrifuge is designed to sup- 
port these servers executing arbitrary application logic on 
any particular in-memory data item they hold, and Cen- 
trifuge helps route requests to the server assigned a lease 
for any given piece of data, enabling the computation and 
data to be co-located. The nature of the in-memory data 
covered by the lease is service-specific, e.g., it could be 
the rendezvous information for the currently connected 
participants in a video-conferencing session, or a queue 
of messages waiting for a user who is currently offline. 
Furthermore, Centrifuge does not store this data on be- 
half of the application running on the in-memory server 
(1.e., Centrifuge is not a distributed cache). Instead, the 
application manages the relationship between the Cen- 
trifuge lease and its own in-memory data. Centrifuge has 
no knowledge of the application’s data; it only knows 
about the lease. 

A common design pattern in industry for a datacenter 
in-memory server pool is to spread hundreds of millions 
of objects across hundreds of servers [36, 34, 35, 21, 23]. 
Additionally, these applications are designed so that even 
the most heavily loaded object requires much less than 
one machine’s worth of processing power. As a result, 
each object can be fully handled by one machine holding 
an exclusive lease, thereby eliminating the usefulness of 
read-only leases. Because of this, Centrifuge only grants 
exclusive leases, simplifying its API and internal design 
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Figure 1: Servers using Centrifuge link in libraries that 
talk to a Centrifuge Manager service. 


without compromising its usefulness in this application 
domain. 

Centrifuge’s architecture is shown in Figure 1. 
Servers that want to send requests link in a Lookup li- 
brary, while servers that want to receive leases and pro- 
cess requests link in an Owner library. In our video con- 
ferencing example, web server frontends would link in 
the Lookup library and forward requests for a particular 
conference’s rendezvous information to the appropriate 
in-memory server linking in the Owner library. Both the 
libraries communicate with a logically centralized Man- 
ager service that is implemented using a replicated state 
machine. 

At a high level, the job of the Manager service is 
to partition a flat namespace of “keys” among all the 
servers linking in Owner libraries. The Manager service 
does this by mapping the key space into variable-length 
ranges using consistent hashing [6] with 64 virtual nodes 
per Owner library. The Manager then conveys to each 
Owner library its subset of the map (1.e., its partition- 
ing assignment) using a lease protocol. We refer to this 
technique as manager-directed leasing: the Centrifuge 
manager controls how the key space is partitioned and 
assigns leases directly on the variable-length ranges as- 
sociated with these partitions. As a result, the manager 
avoids the scalability problems traditionally associated 
with fine-grained leasing. 

When a new Owner library contacts the Manager ser- 
vice, the Manager service recalls the needed leases from 
other Owner libraries and grants them to the new Owner 
library. Centrifuge also reassigns leases for adaptive load 
management (described in more detail in Section 2.4). 
Finally, Lookup libraries contact the Manager service to 
learn the entire map, enabling them to route a request to 
any Owner. 

We briefly explain the usage of Centrifuge by walking 
through its use in Live Mesh’s Publish-subscribe service, 
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Figure 2: Datacenter applications are often divided into 
multiple component services, and servicing clients re- 
quests frequently requires communicating with multiple 
such services. Centrifuge is designed to replace only the 
internal network load balancers used by the component 
Services. 


described in more detail in Section 4. Servers that wish 
to publish events to topics link in the Lookup library; 
they lookup the server where a given topic 1s hosted us- 
ing the hash of the topic name as the lookup key. The 
servers hosting these topics link in the Owner library; 
they receive leases on the topics based on the hash of the 
topic name. When a server has an event to publish, it 
makes a call to its Lookup library, gets the address of the 
appropriate server hosting the topic, and then sends the 
publish message to this server. When this server receives 
the message, it checks with its Owner library that it holds 
the lease on this particular topic, and then forwards the 
event to all subscribed parties. 

Centrifuge is designed for services that route requests 
which both originate and terminate within the datacen- 
ter. This is depicted in Figure 2. Datacenter applications 
often include many such internal services: for example, 
LinkedIn reports having divided their datacenter appli- 
cation into a client-facing frontend and multiple internal 
services, such as news, profiles, groups and communica- 
tions [36]. In Section 4, we describe how the Live Mesh 
application similarly contains multiple internal services 
that use Centrifuge. If requests originate outside the dat- 
acenter (e.g., from web browsers), using Centrifuge re- 
quires an additional routing step: requests first traverse 
a traditional network load balancer to arrive at frontends 
(e.g., web servers) that link in the Lookup library, and 
they are then forwarded to in-memory servers that link 
in the Owner library. 


2.1 Manager Service 


To describe the Manager service, we first present the 
high availability design. We then present the logic for 
lease management, partitioning, and adaptive load man- 
agement. 
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Figure 3: In the Manager service, one set of servers run 
a Paxos group that provides a state store and a leader 
election protocol, and another set of servers act as either 
leader or standby. Only the current leader executes the 
logic for partitioning, lease management, and communi- 
cation with Lookups and Owners. 


2.1.1 Leader Election and High Availability 


The Manager service’s high availability design is de- 
picted in Figure 3. At a high level, the Manager ser- 
vice consists of two sets of servers: one set of servers 
provides a Paxos group, and the other set of servers act 
as either leader or standby. In detail, the Paxos group 
is used to elect a current leader from the set of standby 
servers, and to provide a highly-available store used by 
the leader and standbys. The current leader executes 
the logic around granting leases to Owner libraries, par- 
titioning, and the protocols used to communicate with 
the Owner and Lookup libraries (or simply, the Owners 
and Lookups). Every time the leader receives a request 
that requires it to update its internal state, it commits the 
state change to the Paxos group before responding to the 
caller. To deal with the case that the leader becomes un- 
responsive, all the standby servers periodically ask the 
Paxos group to become the leader; if a new standby be- 
comes the leader, it reads in all the state from the Paxos 
group, and then resumes processing where the previous 
leader left off. 

In this high availability design, Paxos is only used to 
implement a leader election protocol and a highly avail- 
able state store. Most of the complicated program logic 
runs in the leader and can be non-deterministic. Thus, 
this split minimizes the well-known difficulties of writ- 
ing deterministic code within a Paxos group [37, 27]. 
A similar division of responsibility was also used in the 
Chubby datacenter lease manager [3, 4]. 

At a logical level, all the Owners and Lookups can 
simply send all requests to every leader and standby; they 
will only ever get a response from the leader. For effi- 
ciency, the Owners and Lookups only send requests to 
the server they believe to be the leader unless that server 
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becomes unresponsive. If the leader has become un- 
responsive, the Owners and Lookups start broadcasting 
their messages to all the leader or standby servers until 
one replies, and they then switch back to sending their 
requests to the one leader. Owners and Lookups learn 
of the Manager nodes through an external configuration 
file which can be updated when new Manager nodes are 
placed into service. 

The configuration that we use in deployment is three 
standby servers and five servers running Paxos. This al- 
lows the service to continue operating in the event of 
any two machine failures: the Paxos group requires three 
of its five servers to be operational in order to form a 
majority, while any one standby can become the leader 
and take over communication with all the Owners and 
Lookups. For simplicity, we will hereafter refer to the 
current leader in the Manager service as just the Man- 
ager. 


2.1.2 Partitioning, Leasing and Load Management 


To implement partitioning and lease management, the 
Manager maintains a set of namespaces, one per pool of 
in-memory servers it is managing. Each namespace con- 
tains a table of all the consistent hashing ranges currently 
leased to each Owner, and every leased range is associ- 
ated with a lease generation number. When a new Owner 
contacts the Manager, the Manager computes the new 
desired assignment of ranges to Owners, recalls leases 
on the ranges that are now destined for the new Owner, 
and grants new 60 second leases on these ranges to the 
new Owner as they become available (we show in Sec- 
tion 5.1.1 that 60 second leases are a good fit for our 
deployment environment). The removal of an Owner is 
similar. To support an incremental protocol for convey- 
ing changes to the assignment of leases, the Manager 
also maintains a change log for the lease table. This 
change log is periodically truncated to remove all entries 
older than 5 minutes. We describe the communication 
protocols between the Manager and the Lookup library 
and between the Manager and the Owner library in Sec- 
tions 2.2 and 2.3 respectively. 

The Centrifuge implementation also includes two fea- 
tures that are not yet present in the version running in 
production: state migration and adaptive load manage- 
ment. State migration refers to appropriately notifying 
nodes when a lease is transferred so that they can mi- 
grate the state along with the lease. Though load man- 
agement is found in some network load balancers, we are 
not aware of any that support leasing or state migration. 
To support adaptive load management, Owners report 
their incoming request rate as their load. The adaptive 
load management algorithm uses these load measure- 
ments to add or subtract virtual nodes from any Owner 
that is more than 10% above or below the mean load 
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while maintaining a constant number of virtual nodes 
overall. For example, if one Owner is more than 10% 
above the mean load, a virtual node is subtracted from it 
and added to the least-loaded Owner, even if that least- 
loaded Owner is not 10% below the mean load. The par- 
ticular load management algorithm is pluggable, allow- 
ing other policies to be implemented if Centrifuge re- 
quires them in the future. 


2.2 Lookup Library 


Each Lookup maintains a complete (though poten- 
tially stale) copy of the lease table: for every range, it 
knows both its lease generation number and the Owner 
node holding the lease. Due to the use of consistent 
hashing, this only requires about 200KB in the current 
Centrifuge deployment: 100 owners x 64 virtual nodes 
x 32B per range. This is a tiny amount for the servers 
linking in the Lookup libraries, and the small size is 
one reason the Lookup library caches the complete table 
rather than trying to only cache names that are frequently 
looked up. 

Lookups use the lease table for two purposes. First, 
when the server linking in the Lookup library asks where 
to send a request on a given piece of state, the Lookup 
library reads the (potentially stale) answer out of its lo- 
cal copy of the lease table. Second, when the lease gen- 
eration number on a range changes, the Lookup library 
signals a loss notification unless there 1s a flag set stat- 
ing that the state was cleanly migrated to another Owner. 
At a high level, loss notifications allow servers linking 
in the Lookup library to republish data back in to the in- 
memory server pool; Sections 3 and 4 describe the use of 
loss notifications in more detail. 


2.2.1 Lookup-Manager Protocol 


To learn of incremental changes to the Manager’s 
lease table, each Lookup contacts the Manager once ev- 
ery 30 seconds. An example of this is depicted in Fig- 
ure 4. In this example, the Manager has just recorded 
a change, noted as LSN (log sequence number) 3, into 
its change log. This change split the range [1-9] be- 
tween the Owners B and C, and the lease generation 
numbers (LGNs for short) have been modified as well. 
The Lookup contacts the Manager with LSN 2, indicat- 
ing that it does not know of this change, and the Manager 
sends the change over. The Lookup then applies these 
changes to its copy of the lease table. If the Lookup sends 
over a sufficiently old LSN, and the Manager has trun- 
cated its log of lease table changes such that it no longer 
remembers this old LSN, the Manager replies with a 
snapshot of the current lease table. The Manager also 
sends over a snapshot of the entire lease table whenever 
itis more efficient than sending over the complete change 
list (in practice, we only observe this behavior when the 


Lease Table Lease Table Change Log 
Current LSN:2 Current LSN:3 LSN:2393 
[0-1: [0-1: [1-9: 
Owner=A, LGN=14] Owner=A, LGN=14] Owner=B, LGN=15] 
[1-9: [1-2: > 
Owner=B, LGN=15] Owner=B, LGN=16] _[1-2: 
[2-9: Owner=B, LGN=16] 


Owner=C, LGN=17] [2-9: 
Owner=C, LGN=17] 


“LSN:293 
[1-9: 
Owner=B, LGN=15] : 
> : 
[1-2: 
Owner=B, LGN=16] : 
: [2-9: 
y Owner=C, LGN=17]” y 


Figure 4: Example of the protocol between the Lookup 
and the Manager. 


system is being brought online and many Owners are 
rapidly joining). Because the total size of the lease ta- 
ble is small, this limits the amount of additional data that 
the manager needs to send to Lookups when the system 
experiences rapid changes in Owner membership. 


2.3 Owner Library 


Each Owner only knows about the ranges that are cur- 
rently leased to it. Owners send a message requesting 
and renewing leases every 15 seconds to the Manager. 
Because Manager leases are for 60 seconds, 3 consec- 
utive lease requests have to be lost before a lease will 
spuriously expire. A lease request signals Owner live- 
ness and specifies the leases where the Owner wants re- 
newals. The Manager sends back a response contain- 
ing all the ranges it is renewing and all the new ranges 
it has decided to grant to the Owner. Grants are distin- 
guished from renewals so that if an Owner restarts, it will 
not accept an extension on a lease it previously owned. 
For example, if a just restarted Owner receives a renewal 
on a lease “X’’, it refuses the renewal, and the Manager 
learns that the lease is free. This causes the Manager to 
issue a new grant on the range, thus triggering a change 
in the lease generation number. This change in lease gen- 
eration number ensures that the Manager’s log of lease 
changes reflects any Owner crashes, thus guaranteeing 
that Lookups will appropriately trigger loss notifications. 

Every message from the Manager contains the com- 
plete set of ranges where the Owner should now hold a 
lease. Although we considered an incremental protocol 
that sent only changes, we found that sending the com- 
plete set of ranges made the development and debugging 
of the lease protocol significantly easier. For example, 
we did not have to reconstruct a long series of message 
exchanges from the Manager log file to piece together 
how the Owner or Manager had gotten into a bad state. 
Instead, because each message had the complete set of 
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“Request Leases” 


Manager 


T seconds 
“Lease on [0-1] ; 
: granted for 60 seconds.” ; Lease is unavailable to 
: : other nodes (unless 
ee oe recalled) for 65 seconds 
seconds from when Manager 


granted lease. 


v v 


Figure 5: How the lease protocol between the Owner and 
the Manager guarantees the safety property that at most 
one Owner holds the lease at any given point in time as- 
suming only clock rate synchronization. 


leased ranges, we could simply look at the previous mes- 
sage and the current message to see if the implementation 
was generating the correct messages. Furthermore, be- 
cause of the use of consistent hashing, all messages were 
still quite small: there are 64 leased ranges per Owner 
(one per virtual node) and each range is represented us- 
ing 32B, which adds up to only 2KB per lease message. 


2.3.1 Dealing with Clocks 


Even after including the complete set of ranges in ev- 
ery lease message, there were still two subtle issues in 
the lease protocol. The first subtlety is guaranteeing the 
lease safety property: each key is owned by at most one 
Owner at any given point in time. For Centrifuge, we 
assume clock rate synchronization, but not clock syn- 
chronization. In particular, we assume that the Man- 
ager’s clock advances by no more than 65 seconds in the 
time it takes the Owner’s clock to advance by 60 sec- 
onds. This assumption allows Centrifuge to use the tech- 
nique depicted in Figure 5, and previously described by 
Liskov [20]. The Owner is guaranteed to believe it holds 
the lease for a subset of the time that the Manager makes 
the lease unavailable to others because: (1) the Owner 
starts holding the lease only after receiving a message 
from the Manager, and (2) the Owner’s 60-second timer 
starts before the Manager’s 65-second timer, and 60 sec- 
onds on the Owner’s clock is assumed to take less time 
than 65 seconds on the Manager’s clock. 


2.3.2 Dealing with Message Races 


The second subtlety is dealing with message races. 
We explain the benefits of not having to reason about 
message races using an example involving lease recalls. 
Lease recalls improve the ability of the Manager to 
quickly make leases available to new Owners when they 
join the system — new leases can be handed out after a 
single message round trip instead of waiting up to 60 
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“I don’t have “Here is the lease.” 
the lease.” 
“| have the “Since you returned 
lease again!” the lease, I’ll give it 


to someone else.” 


Figure 6: Without care, message races can lead to violat- 
ing the lease safety property. Centrifuge prevents this by 
including sequence numbers in the lease messages be- 
tween the Manager and the Owner, and using the se- 
quence numbers to filter out such message races. 


OwnerSN: 10: = Owner SN: 10 
Manager SN: 101 : : Manager SN: 101 
Owner SN: 11 


<11,101> 
Manager SN: 101 ee! 


; Owner SN: 11 
 <12,101> : Manager SN: 101 


<11,1023 8: 
OwnerSN: 12 : : Owner SN: 11 
Manager SN: 101: : Manager SN: 102 
Drop! Drop! 


Owner SN: 12 : Owner SN: 12 
Manager SN: 102 : <13,102> Manager SN: 102 


Message : 


Wait to : 
™ gets through; 


resend 


Wait to 
resend 


Figure 7: The Manager and Owner use sequence num- 
bers to filter out messages races. When a message race 
occurs, they wait for a random backoff, and then resend. 


seconds for the earlier leases to expire. However, lease 
recalls introduce the problem of lease recall acknowledg- 
ments and lease grants passing in mid-flight. This prob- 
lem is depicted in Figure 6: When an Owner receives a 
lease recall request, it drops the lease, and then sends a 
message acknowledging the lease recall to the Manager. 
If the Manager has since changed its mind and sent out 
a new lease grant, the Manager needs some way to know 
that it is not safe to act on the lease recall acknowledg- 
ment for the earlier lease grant. 

We solved this problem by implementing a protocol 
that hides message races from the program logic at the 
Manager and Owners. This protocol adds two sequence 
numbers to all lease messages, as depicted in Figure 7. 
The sender of a message includes both its own sequence 
number and the most recent sequence number it heard 
from the remote node. When the Manager receives a 
message from the Owner that does not contain the Man- 
ager’s most recent sequence number, the Manager knows 
that the Owner sent this incoming message before the 
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Owner processed the previous message from the Man- 
ager, and the Manager drops the racing message from the 
Owner. The Owner does likewise. This prevents the kind 
of message race depicted in Figure 6. After either party 
drops a racing message, it waits for a random backoff, 
and then resends its message. Forward progress resumes 
when one node’s message is received on the other side 
before its counterpart initiates its resend, as depicted in 
the Figure. 

After the Manager receives the new message from the 
Owner, the Manager’s own state most likely changes. 
The Manager may send a new message to the Owner, 
but the earlier racing message from the Manager to the 
Owner is permanently discarded. This is because there is 
no guarantee that the previous message to the Owner is 
still valid, e.g., the Manager may no longer want to grant 
a lease to the Owner. Finally, the protocol also includes 
a session nonce (not shown). This prevents an Owner 
from sending a message, crashing and re-establishing a 
connection with new sequence numbers, and then having 
the previously sent message be received and interpreted 
out of context. 


2.4 Scalability 


Centrifuge is designed to work within a datacenter 
management paradigm where the incremental unit of ca- 
pacity is a cluster of a thousand or fewer machines. The 
current services using Centrifuge are all part of the Live 
Mesh application, which does follow this paradigm; it 
can be deployed into some number of clusters, and each 
individual cluster is presumed to have good internal net- 
work connectivity (e.g., no intra-cluster communication 
crosses a WAN connection). The use of clusters deter- 
mines our scalability goals for Centrifuge — it must be 
able to support all the machines within a single clus- 
ter. Also, this level of scalability is sufficient to meet 
Centrifuge’s goal of replacing internal network load bal- 
ancers, which in this management paradigm are never 
shared across clusters. As we show in Section 5, Cen- 
trifuge scales well beyond this point, and thus we did not 
investigate further optimizing our implementation. 


2.5 Loss Notification Rationale 


One alternative to our design for loss notifications is 
to replicate the in-memory state across multiple Owner 
nodes. The primary benefit of this alternative design is 
that applications could obtain higher availability during 
node failures because there would be no need to wait for 
clients to republish data lost when nodes fail or reboot. 
However, the downsides would be significant. First, be- 
cause replication cannot handle widespread or correlated 
failures, there is no benefit in terms of simplifying appli- 
cations: the loss notification mechanism is still needed. 
Second, there is the additional cost of the RAM needed 


// Lookup API 
URL Lookup (Key Key) 
void LossNotificationUpcall (KeyRange[] lost) 


// Owner API 

bool CheckLeaseNow (Key key, out LeaseNum leaseNum) 

bool CheckLeaseContinuous (Key key, LeaseNum leaseNum) 

void OwnershipChangeUpcall (KeyRange[] grant, 
KeyRange[] revoke) 





Figure 8: The Centrifuge API divided into its Lookup and 
Owner parts. We omit asynchronous versions of the calls 
and calls related to dynamic load balancing and state mi- 
gration. Upcalls are given as arguments to the relevant 
constructors. 


to hold multiple copies of the state at different nodes. 
Given the infrequency of node failures and reboots in the 
datacenter, it seems unlikely that the benefits of repli- 
cation outweigh the significant costs of holding multi- 
ple copies in RAM. Finally, the implementation would 
be significantly more complex both in terms of the man- 
ager logic around leasing and partitioning as well as the 
Owner logic around performing the operations. This 
would add the requirement that the Owner actions be- 
come deterministic because Centrifuge has no notion of 
what actions the application is performing at the Owner 
nodes. 


3 API 


A simplified version of the Centrifuge API is shown 
in Figure 8. The API is divided into the calls exported by 
the Lookup library and the calls exported by the Owner 
library. We explain this API using the Publish-subscribe 
service, shown in Figure 9, as our running example. 


3.1 Lookup 


The servers that wish to make use of the Publish- 
subscribe service must link in the Lookup library. When 
the server in the example wishes to send a message sub- 
scribing its URL to a particular topic (in the example, 
the message is “Subscribe(1, http://A)’”) the server calls 
Lookup(“‘1’’?) and gets back the URL for the Publish- 
subscribe server responsible for this subscription list. 

The semantics of Lookup() are that it returns hints. If 
it returns a stale address (e.g., the address of a Publish- 
subscribe server that is no longer responsible for this sub- 
scription list), the staleness will be caught (and the re- 
quest rejected) at the Publish-subscribe server using the 
Owner API. Because the Manager rapidly propagates up- 
dated versions of the lease map to the Lookup library, 
callers should retry after a short backoff on such rejected 
requests. When the system is quiesced (i.e., no servers 
are joining or leaving the system), all calls to Lookup() 
return the correct address. 
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Server A that subscribes 
to notifications. 


Server N that subscribes 
to notifications. 





eee 
Lookup Library 


) “Subscribe(1,http://A)” 


Owner Library 


Lookup Library 





Owner Library 


Publish-subscribe Server: 
responsible for [4-5] 


Publish-subscribe Server: 
responsible for [1-2] 





Figure 9: We explain the Centrifuge API using the 
Publish-subscribe service’s Subscribe() operation. 


If a node crashes, the servers linking in the Lookup 
library may wish to learn of this crash so they can proac- 
tively republish the data that was lost. In the example of 
Figure 9, Server A can respond to a loss notification by 
re-sending its earlier subscribe message. As mentioned 
in Section 2.2, the Manager enables this by assigning 
new lease generation numbers to all the ranges held by 
the crashed node (even if they are assigned back to the 
crashed node after it has recovered). All the Lookup 
libraries learn of the lease generation number changes 
from the Manager, and they then signal a LossNotifica- 
tionUpcall() on the appropriate ranges. 


3.2} Owner 


The Owner part of the API allows a Publish-subscribe 
server to perform an operation guarded by a lease. Be- 
cause the Manager may not have assigned any particular 
lease to this Owner, the Owner must be prepared to fail 
this operation if it does not have the lease; there is no API 
call that forces the Manager to give the lease for a given 
key to the Owner. 

The code for the Subscribe operation at the Publish- 
subscribe servers is shown in Figure 10. This code uses 
leases to guarantee that requests for a given subscription 
list are only being served by a single node at a time, and 
that this node is not operating on a stale subscription list 
(i.e., a subscription list from some earlier time that the 
node owned the lease, but where the node has not held 
the lease continuously). 

The general pattern to using the Owner API is shown 
at the top of Figure 10. When a request arrives at 
a Publish-subscribe server, step (1) is to call Check- 
LeaseNow() to check whether it should serve the request. 
If this call succeeds, step (2) is to validate any previously 
stored state by comparing the current lease number with 
the lease number from when the state was last modified. 
If the Publish-subscribe server has held the lease con- 
tinuously, this old lease number will equal the Owner li- 
brary’s current lease number, and the check will succeed. 
If the Publish-subscribe server has not held the lease con- 
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// General pattern: 

// (1) Check that request has arrived at correct node 
// (2) Check that existing state is not stale; 

te discard state thet turns out to be stale 

// (3) Perform arbitrary operation on this state; 

ae store lease number with any created state 

// (4) Check that lease has been continuously held; 
oi if so, return result 


bool Subscribe(key, address) { 
// (1) Check that this is the correct node 
ok = CheckLeaseNow (key, out currentLeaseNum) ; 
if (!ok) return false; 
// (2) Check that existing state is not stale; 
ee discard state that turns out to be stale 
storedLeaseNum = this.leases[key]; 


1f (currentLeaseNum != storedLeaseNum) 
this.subscriptionLists |(key] = Emptylist (); 
// in this app, okay to reset to EmptyList () 
// if prior state was stale 
// (3) Perform arbitrary operation on this state; 
// store lease number with any created state 
// in this case, simply add subscription and 
// store lease number 
this.subscriptionLists[key] .Add(address) ; 
this.leases[key] = currentLeaseNum,; 
// (4) Check that lease has been continuously held; 
if so, return result 
(!CheckLeaseContinuous (key, currentLeaseNum) ) 
return false; 
Pecurn. True; 





Figure 10: How the Subscribe() operation uses the 
Owner API. 


tinuously, the old subscription list may be stale (1.e., it 
may not reflect all operations executed on the list), and 
therefore the old subscription list should be discarded. 

Step (3) is to perform an arbitrary operation on this 
state, and then to store any state modifications along with 
the lease number. In the case of Subscribe(), the opera- 
tion is adding the address to the list of subscriptions. The 
newly stored lease number will be checked in future calls 
to Subscribe(). 

Step (4) is to return a result to the caller, or more gen- 
erally, to send a result to some other node. If the lease 
was lost while the operation was in progress, the caller 
can simply return false and does not need to proactively 
clean up the state that 1s now invalid. Future calls to Sub- 
scribe() will either fail at CheckLeaseNow() or will clear 
the invalid state when they find the stored lease number 
to be less than the current lease number. 

Note that throughout this sequence of steps, there is 
no point where the Publish-subscribe server requests a 
lease on a given item. Instead, the Publish-subscribe 
server simply checks what leases it has been assigned. 
As mentioned in Section 1, this is a departure from stan- 
dard lease manager APIs, and the novel Centrifuge API 
is critical to allowing Centrifuge to provide the benefits 
of fine-grained leasing, while only granting leases on a 
small number of ranges. 

This example has focused on the use of leases within 
a single service, but lease numbers can also be used in 
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communication between services. For example, a server 
linking in the Owner library can include a lease number 
in a request to another service. The other service can 
then guard against stale messages by only processing a 
request if the included lease number is greater than or 
equal to any previously seen lease number. This tech- 
nique has also been described in previous work on lease 
managers; for example, it is one of the patterns for us- 
ing Chubby’s lease numbers, which are called Chubby 
sequencers [3]. 

The last part of the Owner API is the Owner- 
shipChangeUpcall(). This upcall may be used to initial- 
ize data structures when some new range of the key space 
has been granted, or to garbage collect state associated 
with a range of the key space that has been revoked. Be- 
cause of thread scheduling and other effects, this upcall 
may be delivered some short time after the lease change 
occurs. 


4 How Centrifuge Supports the Live Mesh 
Services 


Centrifuge is used by five component services that are 
part of the Live Mesh application [22]. All these com- 
ponent services were built by other developers. The Live 
Mesh application provides a number of features, includ- 
ing file sharing and synchronization across devices, noti- 
fications of activity on these shared files, a virtual desk- 
top that is hosted in the datacenter and allows manipulat- 
ing these files through a web browser, and connectivity 
to remote devices (including NAT traversal). In the re- 
mainder of this Section, we describe how three of the 
Live Mesh services use Centrifuge to enable a particular 
scenario. We then explain how these services were sim- 
plified by leveraging the lease semantics of Centrifuge. 
Finally, we discuss some common characteristics of the 
Live Mesh services and how these characteristics influ- 
enced the design of Centrifuge. 


4.1 Example of Three Live Mesh Services 


The particular scenario we focus on is one where a 
user has two PCs, one at home and one at work, and both 
are running the Live Mesh client software. At some point 
the user wants to connect directly from their work PC to 
their home PC so as to retrieve an important file, but the 
home PC has just been given a new IP address by the 
user’s ISP. To enable the work PC to find the home PC, 
the home PC needs to publish its new IP address into the 
datacenter, and the work PC then needs to learn of the 
change. 

Figure 11 depicts how the component services enable 
this scenario. First, the home PC sends its new IP ad- 
dress in a “publish new IP” request to a Frontend server. 
The Frontend server calls Lookup() using the home PC’s 


“publish new IP” 


Frontend 

Web Server} ” * “ 
Device 

Connectivity J" * " 


Publish- }... 

subscribe 
Figure 11: How three Centrifuge-based services within 
the Live Mesh application cooperate to enable a work 


PC to connect to a home PC that has just acquired a new 
IP address. 


Work PC 


“dequeue messages” 


Frontend 
Web Server 














Device 


Connectivity 
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subscribe 


“DeviceID” as the key and routes this request to the in- 
memory Device Connectivity server tracking the IP ad- 
dress for this home PC. The Device Connectivity servers 
use the Owner library with DeviceIDs as the keys, and 
they store the IP address for a device under each key. 
After updating the IP address, the Device Connectiv- 
ity server sends a message with this new IP address to the 
in-memory Publish-subscribe server tracking subscrip- 
tions to the home PC’s device connectivity status; the De- 
vice Connectivity server uses the home PC’s DeviceID 
as the key for calling Lookup(). The Publish-subscribe 
servers also use the Owner library with DeviceIDs as the 
keys, but under each key, they store a subscription list 
of all devices that want to receive connectivity status up- 
dates from the home PC. Finally, the Publish-subscribe 
server sends out a message containing the device con- 
nectivity update to each subscribed client device. 
Because some of the subscribed client devices (such 
as the user’s work PC) may be offline, each message is 
routed to an in-memory Queue server. If a subscribed 
client device is online, it maintains a persistent connec- 
tion to the appropriate Queue server, and the messages 
are immediately pushed over the persistent connection. 
If a subscribed client device is offline and later comes 
online (the case depicted for the work PC in Figure 11), 
it sends a “dequeue messages” request. This request 
arrives at a Frontend and is routed to the appropriate 
Queue server, which responds with the message contain- 
ing the new IP address. The Publish-subscribe service is 
the source for all the messages sent through the Queue 
servers; all these messages describe changes for datacen- 
ter objects that the client had previously subscribed to, 
such as the device connectivity status in this example. 
The Queue servers also use the Owner library with 
DevicelDs as the keys, and they store a message queue 
under each key. The Queue servers also link in the 
Lookup library so that they can receive a LossNotifi- 
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cationUpcall() if any Device Connectivity or Publish- 
subscribe server crashes. If it receives a loss notification, 
the Queue server puts the notification into the queue for 
any client device that had state stored on the server that 
crashed, allowing the client device to quickly learn that 
it needs to re-publish its IP address, poll to re-read other 
client devices’ IP addresses, and/or re-subscribe to learn 
of future IP address changes. 

Because these servers are all co-located within the 
same datacenter, we assume that there are no persis- 
tent intransitive connectivity failures between these ma- 
chines, and therefore if the Device Connectivity, Publish- 
subscribe and Queue servers can all talk to the Centrifuge 
Manager and renew their leases, they will also be able to 
eventually send messages to each other. If an initial mes- 
sage send fails (perhaps because the Lookup library had 
slightly stale information), the server should simply retry 
after a short period of time. 

Because of the in-memory nature of these server 
pools, a crash always results in the crashed server los- 
ing its copy of the data. All the services currently us- 
ing Centrifuge rely on Centrifuge’s loss notifications for 
two purposes: to trigger the client to re-create any lost 
data within the in-memory server pool and to poll all 
datacenter objects where a relevant change message may 
have been lost. The client already knows the small set 
of datacenter objects it needs to poll because it had pre- 
viously subscribed to them using the Publish-subscribe 
service, and the Queue service is only used to deliver 
change messages from these subscriptions and loss no- 
tifications. For these services, relying on the client for 
state re-creation is sufficient because the state is only 
useful if the client is online. To pick one example, if 
a Client is offline, its last published IP address is irrele- 
vant. Furthermore, the reliance on clients to recreate the 
state allows the service to forgo the expense of storing re- 
dundant copies of this state on disk within the datacenter 
(although Centrifuge itself is also compatible with recov- 
ery from datacenter storage). Lastly, although the power 
of leases to simplify distributed storage systems is well- 
established [40, 13, 25, 3], the next several sections elab- 
orate on how leases can also simplify services that are not 
tightly integrated with storage (the Live Mesh services). 


4.2 Simplifications to Device Connectivity 
Service 


The main simplification from using leases in the De- 
vice Connectivity service is that all updates logically 
occur at a single server. This means that there are 
never multiple documents on multiple servers contain- 
ing the IP address about a single device. Because there 
are never multiple documents, there is no need to write 
application-specific logic about reconciling the docu- 
ments, and there is no need to add application-specific 
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metadata simply to aid in document reconciliation (e.g., 
the time at which the IP address was updated). Although 
it may be feasible to design a good reconciliation heuris- 
tic for device IP addresses, this is yet another tax on the 
developer. Additionally, the Device Connectivity service 
is also used to store other kinds of data besides IP ad- 
dresses, and reconciliation becomes more difficult as the 
data becomes more complex. The use of leases allows 
the developer to avoid writing the application-specific 
reconciliation routines for IP addresses and for all these 
other kinds of data. 


4.3 Simplifications to Queue Service 


In the Queue service, the main simplification from us- 
ing leases is that the service can provide a strong guar- 
antee to all its callers: once a message has been success- 
fully enqueued, either the client will receive it, or the 
client will know it has lost some messages and must ap- 
propriately poll. This allows the Publish-subscribe server 
to consider itself “finished” with a message once it has 
been given to the Queue service; the Publish-subscribe 
server does not have to deal with the possibility that 
the client device neither received the message nor even 
learned that it lost a message. Such silent message loss 
could be particularly frustrating; for example, the home 
PC could publish its new IP address without the work PC 
ever learning of the change, leading to a long-lived con- 
nectivity failure. The Queue service’s guarantee prevents 
this problem. 

To provide this guarantee, the Queue service relies on 
there being at most one copy of a queue at any given 
point in time. In contrast, if there were multiple copies 
of a queue, the Publish-subscribe server might believe it 
successfully delivered a message, while the client device 
only ever connected to another copy of the queue to read 
its messages. While it may be feasible to build a proto- 
col that addresses this issue in alternative ways (perhaps 
using counters or version vectors), the Centrifuge lease 
mechanism avoided the need for such an additional pro- 
tocol. 


4.4 Simplifications to Publish-subscribe Ser- 
vice 


The simplifications from using leases in the Publish- 
subscribe service are similar to the simplifications in the 
Queue service. The Publish-subscribe service uses leases 
to provide the following strong guarantee to its callers: 
once a message has been accepted by the Publish- 
subscribe service, each subscriber will either receive the 
message or know that they missed some messages. The 
details of how leases enable this guarantee in the Publish- 
subscribe service are essentially the same as those de- 
scribed in Section 4.3 for the Queue service. 
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4.5. Live Mesh Service Characteristics 


We now briefly discuss how the common character- 
istics of the Live Mesh services influenced the design 
choices we made for Centrifuge. 

All of the Live Mesh services that use Centrifuge store 
a relatively large number of small objects in memory on 
each server, and the operations performed on these ob- 
jects are relatively lightweight. As a result, we designed 
Centrifuge to rely on statistical multiplexing over these 
many small objects. This allows Centrifuge to operate on 
variable length ranges of the keyspace, rather than pro- 
viding a directory service optimized for moving individ- 
ual items. None of the current Live Mesh services need 
to support data items where the processing load on an 
individual object is near the capacity of an entire server. 
Services with such a workload would find statistical mul- 
tiplexing much less effective. To enable such workloads, 
one might consider modifications to Centrifuge such as: 
a load balancing algorithm with different dynamics; a 
different partitioning algorithm; and support for replicat- 
ing objects across multiple owners. 

Another aspect of the Centrifuge design motivated by 
the needs of Live Mesh services is the capability of re- 
covering state from clients, rather than just from storage 
servers within the datacenter. Most of the state stored at 
the Owner nodes for the Live Mesh services is cached 
data where the original copy is known at the client (e.g., 
the client’s IP address for the Device Connectivity ser- 
vice). The need to recover state from clients led to our 
design for loss notifications, where these notifications are 
delivered to the Lookup nodes, rather than building re- 
covery functionality directly into the Owner nodes. 

Finally, as we will show in Section 5, reboots and 
machine failures in the Live Mesh production environ- 
ment are infrequent. This observation, combined with 
the fact that large quantities of RAM are still reason- 
ably expensive, led to our decision to recover state from 
the clients rather than replicating state across multiple 
Owner nodes. 


5 Evaluation 


As we mentioned in the Introduction, Centrifuge has 
been deployed in production as part of Microsoft’s Live 
Mesh application since April of 2008 [22]. In Section 5.1 
we examine the behavior of Centrifuge in production 
over an interval of 2.5 months, stretching from early De- 
cember 2008 to early March 2009. During this time, 
there were approximately 130 Centrifuge Owners and 
approximately 1,000 Centrifuge Lookups. Previewing 
our results, we find that Centrifuge easily scaled to meet 
the demands of this deployment. In Section 5.2, we use 
a testbed to examine Centrifuge’s ability to scale beyond 
the current production environment. 
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Figure 12: CPU and network load measurements from 
the leader and standbys in the Centrifuge Manager ser- 
vice deployed in production over 2.5 months, as well as 
messages dropped due to races. 


5.1 Production Environment 


Because the Centrifuge Manager is the scalability bot- 
tleneck in our system, we focus our observation on the 
behavior of this component. We first examine the steady- 
state behavior of the Manager over 2.5 months, and then 
focus on the behavior of the Manager during the hours 
surrounding the rollout of a security patch. 


5.1.1 Steady State Behavior 


Figures 12(a) and (b) show measurements from each 
leader and standby server at the granularity of an hour 
from 12/11/2008 to 3/2/2009. We observe that both the 
CPU and network utilization is very low on all these 
servers, though it is slightly higher on the current leader 
at any given point in time, and there are bursts of network 
utilization when a standby takes over as the new leader, 
as on 12/16/2008 and 1/15/2009. The low steady-state 
CPU and network utilization we observe provides evi- 
dence that our implementation easily meets our current 
scalability needs. 

In both cases where the leader changed, the relevant 
administrative logs show that a security patch was rolled 
out, requiring restarts for all servers in the cluster. The 
patch rollout on 12/16/2008 at 06:30 led to the leader 
changing from Server 2 to Server 3, while the patch roll- 
out on 1/15/2009 at 21:20 led to the servers rotating the 
leader role in quick succession, with Server 3 resuming 
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the leader role at the end of this event. Because the sec- 
ond security patch rollout led to multiple changes in the 
leader, we examine the dynamics of this rollout in more 
detail in Section 5.1.2. 

Because there were no security patch rollouts between 
1/16/2009 and 3/2/2009, we examined this 1.5 month pe- 
riod to see how frequently Owners lost their leases due 
to crashing, network disconnect, or any other unplanned 
event. An Owner crash can make the portion of the 
key space assigned to that Owner unavailable until the 
Owner’s lease has expired, and we use 60-second leases 
(as mentioned in Section 2.1.2) because we expect un- 
planned Owner failures to be quite rare. We observed 
a total of 10 lease losses from the 130 Owners over the 
entire 1.5 months. This corresponds to each individual 
Owner having a mean time to unplanned lease loss (_.e., 
not due to action by the system administrator) of 19.5 
months. This validates our expectation that unplanned 
Owner failures are quite rare. Finally, the number of 
Owners returned to 130 in less than 10 minutes follow- 
ing 7 of the lease losses, and in about an hour for the 
other 3 lease losses; this shows that Owner recovery or 
replacement by the cluster management system [16] is 
reasonably rapid. 

We also examine one aspect of the Manager-Owner 
lease protocol in detail over this same 2.5 month inter- 
val. As described in Section 2.3.2, our lease protocol 1n- 
corporates a simple mechanism for preventing message 
races from compromising the lease safety invariant dur- 
ing lease recalls: detecting such races and dropping the 
messages. We do not expect message races to be com- 
mon, but if they were, repeated message drops might 
lead to one of the Owners losing its lease. This moti- 
vates us to examine the number of messages dropped due 
to race detection, as shown in Figure 12(c). We first ob- 
serve that message races do occur in bursts when a new 
standby takes the leader role. However, during the 1.5 
months from 1/16/2009 to 3/2/2009 where Server 3 con- 
tinuously held the leader role, only 12 messages were 
dropped. This observation validates our expectation that 
message races are very rare in steady state. 


5.1.2 Rollout of Security Patch 


Figure 13 examines a 2.5 hour window at a 30 second 
granularity where all 3 standby servers rotated through 
the leader role. Servers | and 3 were restarted at approxi- 
mately 21:20. Server 2 then took over as the leader, lead- 
ing to slightly higher CPU utilization and significantly 
higher network utilization at this server. Network uti- 
lization peaks at almost 500 KB/sec, significantly more 
than the steady state of around 10 KB/sec. At approx- 
imately 21:45, Server 2 was similarly restarted, leading 
to Server | taking over as the leader. Finally, at approx- 
imately 22:20, Server 3 took the leader role from Server 
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Figure 13: Measurements from the evening of 1/15/2009 
for the leader and standby servers in the Manager ser- 
vice deployed in production, capturing CPU, network 
load, and Owner restarts. 


2. We do not know why this last change in leader role 
occurred, as Server 2 was not restarted at this point. In 
all cases, restarts were preceded by a spike in CPU uti- 
lization, likely due to applying the patch. 

Figure 13(c) shows the number of live Owners seen by 
each leader. The number of live Owners dips at approx- 
imately the same time as the change in leader, reflecting 
how the patch is applied to one group of servers, and then 
applied to another group of servers after a modest inter- 
val. 

Figure 13(b) also shows that network utilization con- 
tinued to experience bursts well after a new standby had 
taken the leader role. From additionally examining the 
Manager logs, we find that 76% of this network traffic is 
due to the Lookup libraries, likely because after restart- 
ing they need to get the entire lease table from the leader. 

Figure 13 show that even during a period of abnor- 
mally high churn in both Owners, Lookups and Man- 
agers, the observed load at the leader and standbys in 
the Manager service is small. This offers further evi- 
dence that the Centrifuge implementation meets its cur- 
rent scalability demands. 

Finally, we investigated one of the Owners to double 
check that the stability we observe in the Manager was 
also reflected in the Owner API success rate. We arbitrar- 
ily chose the approximately 5-day time period 1/8/2009 
22:00 to 1/13/2009 17:00, a period when the Manager 
service did not observe any churn. During this time pe- 
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Role # Servers | Instances/ Total 
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Manager 5 ] 5 
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Table 1: Centrifuge testbed configuration. 
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riod, this particular Owner experienced 0 failed calls to 
CheckLeaseNow() and CheckLeaseContinuous() out of 
over 53 million invocations. This is consistent with the 
intuition that the stability observed at the Manager results 
in calls to the Owner API always succeeding. 


5.2 Testbed 


In this subsection, we evaluate Centrifuge’s ability 
to scale beyond the current production deployment. In 
particular, we use more Lookups and Owners than are 
present in the production setting, and we evaluate the 
Manager load when these Lookups and Owners are 
restarted more rapidly than in a production patch rollout. 
Previewing our results, we find that the Manager easily 
scales to this larger number of Owners and Lookups and 
this more rapid rate of restarts. 

Our testbed consists of 40 servers, each running a 2.26 
GHz Core2 Duo processor with 4GB RAM and the Win- 
dows Server 2008 SP2 operating system. Table 1 shows 
our testbed configuration. The approximately 10:1 ra- 
tio between Device Connectivity servers (Owners) and 
Frontends (Lookups) was chosen based on the ratio de- 
ployed in production. We made minor modifications to 
the performance counter implementation on the Device 
Connectivity Servers and Frontends in order to run this 
many instances on each server. 

To examine the ability of Centrifuge to support a more 
rapid patch rollout rate across this larger number of Own- 
ers and Lookups, we conducted two separate experi- 
ments. In the first experiment, we restarted all the Own- 
ers over the course of 32 minutes, and in the second ex- 
periment, we restarted all the Lookups over the same in- 
terval. Compared to the production patch rollout of Sec- 
tion 5.1.2, this is restarting approximately twice as many 
nodes (2,200) in half as much time (1 hour). In both ex- 
periments, we measure CPU and network usage at the 
leader in the Manager service. Figures 14 shows the re- 
sults: even when the Owners were restarting, the leader 
CPU averaged only light utilization, and network usage 
only went up to 5 MB/sec. Based on this, we conclude 
that the Centrifuge implementation supports the current 
deployment by a comfortable margin. 
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Figure 14: The CPU and network load on the Manager 
leader under rapid restarts for a large number of Owners 
and Lookups. 


6 Related Work 


Centrifuge integrates lease management and partition- 
ing into a system that makes it easier to build certain 
datacenter services. Accordingly, we divide our discus- 
sion of related work into lease managers, partitioning 
systems, and other software infrastructure for building 
datacenter services. 


6.1 Lease Managers 


The technique of using leases to provide consistency 
dates back over two decades to Gray and Cheriton’s work 
on the V file system [13]. In this section we survey the 
three leasing systems most closely related to Centrifuge: 
Frangipani’s lease manager [40] because of its approach 
to scalability; Chubby [3] because of its use to support 
datacenter applications; and Volume Leases [42, 41] be- 
cause of how it grants leases on many objects at a time. 

Frangipani implements a scalable lease management 
service by running multiple lease managers, and hav- 
ing each of these lease managers handle a different sub- 
set of the total set of leases. Centrifuge scales using a 
very different technique: exposing a novel API so that 
lease recipients receive all or none of the leases within a 
range. Centrifuge’s design allows a pool of in-memory 
servers needing a large number of leases to be supported 
by a single lease manager. However, the techniques in 
Centrifuge and Frangipani are composable: one could 
imagine applying Frangipani’s technique to further scale 
Centrifuge by creating multiple Centrifuge managers and 
having each of them handle a different subset of the lease 
namespace. 

Like Centrifuge, Chubby implements a lease manager 
that funnels all clients through a single machine and re- 
lies on a Paxos layer for high availability. Chubby is de- 
signed to be used primarily for leader election in a dat- 
acenter, and in practice it typically maintains around a 
thousand locks to support tens of thousands of clients [3]. 
In contrast, Centrifuge directly provides both partition- 
ing and leases on ranges from a partitioned namespace, 
enabling Centrifuge to replace internal network load bal- 
ancers. 
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Supporting Centrifuge’s functionality using Chubby 
would require modifying Chubby to incorporate ele- 
ments from Centrifuge, including the partitioning func- 
tionality. Alternatively, one could imagine extracting the 
partitioning functionality out of an existing system such 
as BigTable [5], modifying it to support Centrifuge’s par- 
titioning algorithm, and then deploying this additional 
service alongside Chubby. 

Volume Leases [42, 41] is a protocol for granting 
leases on groups of objects, such as all the files and di- 
rectories in a file system volume or all the web pages 
served by a single server. Centrifuge differs from Vol- 
ume Leases in at least two major ways. First, Centrifuge 
can dynamically create new object groupings by send- 
ing out new ranges, while none of the work on Vol- 
ume Leases investigated dynamically creating volumes. 
In Centrifuge, dynamically splitting and merging ranges 
underlies the majority of the logic around partitioning 
(the policy for modifying ranges) and leasing (the mech- 
anism for conveying the modified ranges). Secondly, the 
work on Volume Leases did not include high-availability 
for the lease manager; Volume Leases are designed to be 
the cache coherency protocol for a system that might or 
might not incorporate high availability, not a stand alone 
lease manager. 


6.2 Partitioning Systems 


The three most closely related pieces of prior re- 
search in partitioning are the partitioning subsystem of 
BigTable [5], network load balancers [9, 28], and other 
software partitioning systems such as DHTs [32, 38, 31, 
30] and the LARD system [29]. We already compared 
Centrifuge to the partitioning subsystem of BigTable as 
part of our comparison to Chubby. 

The main contrast between Centrifuge and network 
load balancers is that network load balancers do not in- 
clude lease management. Attempting to add lease man- 
agement (and the requisite high availability) would con- 
stitute a major addition to these systems, possibly mir- 
roring the work done to build Centrifuge. 

Centrifuge implements partitioning using client li- 
braries, an approach previously taken by many DHTs 
and the LARD system. Compared to this prior work, 
the contribution of Centrifuge is demonstrating that com- 
bining such partitioning with lease management simpli- 
fies the development of datacenter services running on 
in-memory server pools. Though there has been a great 
deal of work on DHTs, we are not aware of any DHT- 
based system that explores this integration, most likely 
because DHTs often focus on scaling to billions of peers 
and providing leases on the DHT key space is viewed 
as incompatible with this scalability goal. In contrast, 
Centrifuge does incorporate a lease manager to provide a 
better programming model for in-memory server pools, 
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and Centrifuge only aims to scale to hundreds of such 
servers. 


6.3 Other Infrastructure for Datacenter 
Services 


There is great interest within both industry and the re- 
search community in providing better infrastructure for 
datacenter services [2, 12, 15, 11]. Much of this inter- 
est has been directed at datacenter storage: BigTable [5], 
Dynamo [8], Sinfonia [1] and DDS [14] form a repre- 
sentative sample. Centrifuge is designed to support in- 
memory server pools. These server pools may want to 
leverage such a datacenter storage system to support re- 
loading data in the event of a crash, but they also ben- 
efit from partitioning and leasing within the in-memory 
server pool itself. 

Distributed caching has been widely studied, often in 
the context of remote file systems [7, 10, 19, 18, 24, 6], 
and it is commonly used today in datacenter applications 
(e.g., memcached [26]). Centrifuge differs from such 
systems in that Centrifuge is not a cache; Centrifuge 
is infrastructure that makes it easier to build other ser- 
vices that run on pools of in-memory servers. Because 
Centrifuge is a lower-level component than a distributed 
cache, it can serve a wider class of applications. For ex- 
ample, the Publish-subscribe Service described in Sec- 
tion 4 devotes significant logic to state management (e.g., 
deciding what state to expire, what state to lock, evalu- 
ating access control rules, etc.). This logic is service- 
specific, and it is not something that the service de- 
veloper would want to abdicate to a generic distributed 
cache. In contrast, the Publish-subscribe Service can 
leverage Centrifuge because it only provides the lower- 
level services of leasing and partitioning. 


7 Conclusion 


Datacenter services are of enormous commercial im- 
portance. Centrifuge provides a better programming 
model for in-memory server pools, and other developers 
have validated this by using Centrifuge to build multi- 
ple component services within Microsoft’s Live Mesh, a 
large-scale commercial cloud service. As datacenter ser- 
vices continue to increase in complexity, such improve- 
ments in programmability are increasingly vital. 
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Abstract: As cloud services grow to span more and more 
globally distributed datacenters, there is an increasingly 
urgent need for automated mechanisms to place applica- 
tion data across these datacenters. This placement must 
deal with business constraints such as WAN bandwidth 
costs and datacenter capacity limits, while also mini- 
mizing user-perceived latency. The task of placement is 
further complicated by the issues of shared data, data 
inter-dependencies, application changes and user mobil- 
ity. We document these challenges by analyzing month- 
long traces from Microsoft’s Live Messenger and Live 
Mesh, two large-scale commercial cloud services. 

We present Volley, a system that addresses these chal- 
lenges. Cloud services make use of Volley by submitting 
logs of datacenter requests. Volley analyzes the logs us- 
ing an iterative optimization algorithm based on data ac- 
cess patterns and client locations, and outputs migration 
recommendations back to the cloud service. 

To scale to the data volumes of cloud service logs, 
Volley is designed to work in SCOPE [5], a scalable 
MapReduce-style platform; this allows Volley to per- 
form over 400 machine-hours worth of computation in 
less than a day. We evaluate Volley on the month-long 
Live Mesh trace, and we find that, compared to a state- 
of-the-art heuristic that places data closest to the pri- 
mary IP address that accesses it, Volley simultaneously 
reduces datacenter capacity skew by over 2x, reduces 
inter-datacenter traffic by over 1.8x and reduces 75th 
percentile user-latency by over 30%. 


1 Introduction 


Cloud services continue to grow rapidly, with ever 
more functionality and ever more users around the globe. 
Because of this growth, major cloud service providers 
now use tens of geographically dispersed datacenters, 
and they continue to build more [10]. A major unmet 
challenge in leveraging these datacenters is automati- 
cally placing user data and other dynamic application 
data, so that a single cloud application can serve each 
of its users from the best datacenter for that user. 


At first glance, the problem may sound simple: de- 
termine the user’s location, and migrate user data to 
the closest datacenter. However, this simple heuris- 
tic ignores two major sources of cost to datacenter 
operators: WAN bandwidth between datacenters, and 
over-provisioning datacenter capacity to tolerate highly 
skewed datacenter utilization. In this paper, we show that 
a more sophisticated approach can both dramatically re- 
duce these costs and still further reduce user latency. The 
more sophisticated approach is motivated by the follow- 
ing trends in modern cloud services: 


Shared Data: Communication and collaboration are 
increasingly important to modern applications. This 
trend is evident in new business productivity software, 
such as Google Docs [16] and Microsoft Office On- 
line [32], as well as social networking applications such 
as Facebook [12], LinkedIn [26], and Twitter [43]. These 
applications have in common that many reads and writes 
are made to shared data, such as a user’s Facebook wall, 
and the user experience is degraded if updates to shared 
data are not quickly reflected to other clients. These 
reads and writes are made by groups of users who need 
to collaborate but who may be scattered worldwide, mak- 
ing it challenging to place and migrate the data for good 
performance. 


Data Inter-dependencies: The task of placing shared 
data is made significantly harder by inter-dependencies 
between data. For example, updating the wall for a 
Facebook user may trigger updating the data items that 
hold the RSS feeds of multiple other Facebook users. 
These connections between data items form a commu- 
nication graph that represents increasingly rich applica- 
tions. However, the connections fundamentally trans- 
form the problem’s mathematics: in addition to connec- 
tions between clients and their data, there are connec- 
tions in the communication graph in-between data items. 
This motivates algorithms that can operate on these more 
general graph structures. 

Application Changes: Cloud service providers want 
to release new versions of their applications with ever 
greater frequency [35]. These new application features 
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can significantly change the patterns of data sharing and 
data inter-dependencies, as when Facebook released its 
instant messaging feature. 

Reaching Datacenter Capacity Limits: The rush in 
industry to build additional datacenters is motivated in 
part by reaching the capacity constraints of individual 
datacenters as new users are added [10]. This in turn 
requires automatic mechanisms to rapidly migrate appli- 
cation data to new datacenters to take advantage of their 
capacity. 

User Mobility: Users travel more than ever to- 
day [15]. To provide the same rapid response regard- 
less of a user’s location, cloud services should quickly 
migrate data when the migration cost is sufficiently inex- 
pensive. 

In this paper we present Volley, a system for auto- 
matic data placement across geo-distributed datacenters. 
Volley incorporates an iterative optimization algorithm 
based on weighted spherical means that handles the com- 
plexities of shared data and data inter-dependencies, and 
Volley can be re-run with sufficient speed that it handles 
application changes, reaching datacenter capacity limits 
and user mobility. Datacenter applications make use of 
Volley by submitting request logs (similar to Pinpoint [7] 
or X-Trace [14]) to a distributed storage system. These 
request logs include the client IP addresses, GUIDs iden- 
tifying the data items accessed by the client requests, and 
the structure of the request “call tree’, such as a client re- 
quest updating Facebook wall 1, which triggers requests 
to data items 2 and 3 handling Facebook user RSS feeds. 

Volley continuously analyzes these request logs to 
determine how application data should be migrated be- 
tween datacenters. To scale to these data sets, Volley is 
designed to work in SCOPE [5], a system similar to Map- 
Reduce [11]. By leveraging SCOPE, Volley performs 
more than 400 machines hours worth of computation in 
less then a day. When migration is found to be worth- 
while, Volley triggers application-specific data migration 
mechanisms. While prior work has studied placing static 
content across CDNs, Volley is the first research system 
to address placement of user data and other dynamic ap- 
plication data across geographically distributed datacen- 
ters. 

Datacenter service administrators make use of Volley 
by specifying three inputs. First, administrators define 
the datacenter locations and a cost and capacity model 
(e.g., the cost of bandwidth between datacenters and the 
maximum amount of data per datacenter). Second, they 
choose the desired trade-off between upfront migration 
cost and ongoing better performance, where ongoing per- 
formance includes both minimizing user-perceived la- 
tency and reducing the costs of inter-datacenter commu- 
nication. Third, they specify data replication levels and 
other constraints (e.g., three replicas in three different 
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datacenters all located within Europe). This allows ad- 
ministrators to use Volley while respecting other external 
factors, such as contractual agreements and legislation. 

In the rest of this paper, we first quantify the preva- 
lence of trends such as user mobility in modern cloud 
services by analyzing month-long traces from Live Mesh 
and Live Messenger, two large-scale commercial data- 
center services. We then present the design and imple- 
mentation of the Volley system for computing data place- 
ment across geo-distributed datacenters. Next, we evalu- 
ate Volley analytically using the month-long Live Mesh 
trace, and we evaluate Volley on a live testbed consist- 
ing of 20 VMs located in 12 commercial datacenters dis- 
tributed around the world. Previewing our results, we 
find that compared to a state-of-the-art heuristic, Volley 
can reduce skew in datacenter load by over 2x, decrease 
inter-datacenter traffic by over 1.8x, and reduce 75th 
percentile latency by over 30%. Finally, we survey re- 
lated work and conclude. 


2 Analysis of Commercial Cloud-Service 
Traces 


We begin by analyzing workload traces collected by 
two large datacenter applications, Live Mesh [28] and 
Live Messenger [29]. Live Mesh provides a number of 
communication and collaboration features, such as file 
sharing and synchronization, as well as remote access to 
devices running the Live Mesh client. Live Messenger 
is an instant messaging application. In our presentation, 
we also use Facebook as a source for examples due to its 
ubiquity. 

The Live Mesh and Live Messenger traces were col- 
lected during June 2009, and they cover all users and de- 
vices that accessed these services over this entire month. 
The Live Mesh trace contains a log entry for every mod- 
ification to hard state (such as changes to a file in the 
Live Mesh synchronization service) and user-visible soft 
state (Such as device connectivity information stored on 
a pool of in-memory servers [1]). The Live Messenger 
trace contains all login and logoff events, all IM con- 
versations and the participants in each conversation, and 
the total number of messages in each conversation. The 
Live Messenger trace does not specify the sender or the 
size of individual messages, and so for simplicity, we 
model each participant in an IM conversation as hav- 
ing an equal likelihood of sending each message, and 
we divide the total message bytes in this conversation 
equally among all messages. A prior measurement study 
describes many aspects of user behavior in the Live Mes- 
senger system [24]. In both traces, clients are identified 
by application-level unique identifiers. 

To estimate client location, we use a standard com- 
mercial geo-location database [34] as in prior work [36]. 
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Figure 1. Simplified data inter-dependencies in Face- 
book (left) and Live Mesh (right). In Facebook, an “up- 
dated wall” request arrives at a Facebook wall data item, 
and this data item sends the request to an RSS feed data 
item, which then sends it to the other client. In Live 
Mesh, a “publish new IP” request arrives at a Device 
Connectivity data item, which forwards it to a Publish- 
subscribe data item. From there, it is sent to a Queue 
data item, which finally sends it on to the other client. 
These pieces of data may be in different datacenters, and 
if they are, communication between data items incurs ex- 
pensive inter-datacenter traffic. 
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Figure 2. Two clients, their four data items and the com- 
munication between them in the simplified Facebook ex- 
ample. Data placement requires appropriately mapping 
the four data items to datacenters so as to simultaneously 
achieve low inter-datacenter traffic, low datacenter ca- 
pacity skew, and low latency. 


The database snapshot is from June 30th 2009, the very 
end of our trace period. 


We use the traces to study three of the trends moti- 
vating Volley: shared data, data inter-dependencies, and 
user mobility. The other motivating trends for Volley, 
rapid application changes and reaching datacenter capac- 
ity limits, are documented in other data sources, such as 
developers describing how they build cloud services and 
how often they have to release updates [1, 35]. To pro- 
vide some background on how data inter-dependencies 
arise in commercial cloud services, Figure 1 shows sim- 
plified examples from Facebook and Live Mesh. In the 
Facebook example, Client 1 updates its Facebook wall, 
which is then published to Client 2; in Facebook, this al- 
lows users to learn of each other’s activities. In the Live 
Mesh example, Client 1 publishes its new IP address, 
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Figure 4. Distribution of clients in the Messenger trace. 


which is routed to Client 2, enabling Client 2 to con- 
nect directly to Client 1; in Live Mesh, this is referred to 
as a notification session, and it enables both efficient file 
sharing and remote device access. The Figure caption 
provides additional details, as do other publications [1]. 
In both cases, the client operations involve multiple data- 
center items; inter-datacenter traffic is minimized by co- 
locating these items, while the latency of this particular 
request 1s minimized by placing the data items as close 
as possible to the two clients. 

Figure 2 attempts to convey some intuition for why 
data sharing and inter-dependencies make data place- 
ment challenging. The figure shows the web of connec- 
tions between just two clients in the simplified Facebook 
example; these inter-connections determine whether a 
mapping of data items to datacenters achieves low inter- 
datacenter traffic, low datacenter capacity skew, and low 
latency. Actual cloud services face this problem with 
hundreds of millions of clients. Each client may ac- 
cess many data items, and these data items may need to 
communicate with each other to deliver results to clients. 
Furthermore, the clients may access the data items from 
a variety of devices at different locations. This leads to a 
large, complicated graph. 

In order to understand the potential for this kind of 
inter-connection to occur between clients that are quite 
distant, we begin by characterizing the geographic diver- 
sity of clients in the traces. 

Client Geographic Diversity: We first study the 
traces to understand the geographic diversity of these ser- 
vices’ client populations. Figures 3 and 4 show the distri- 
bution of clients in the two traces on a map of the world. 
The figures show that both traces contain a geographi- 
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Figure 5. Sharing of data between geographically dis- 
tributed clients in the Messenger and Mesh traces. Large 
amounts of sharing occur between distant clients. 





T T T 























fraction of publish- 
subscribe objects 





SS eo oS SS 
ORPNWAHAUDNOUOP 








0) 25 50 75 100. 4125 150: 175 
number of unique queue objects subscribing 


Figure 6. Data inter-dependencies in Live Mesh be- 
tween Publish-subscribe objects and Queue objects. A 
user that updates a hard state data item, such as a docu- 
ment stored in Live Mesh, will cause an update message 
to be generated at the Publish-subscribe object for that 
document, and all Queue objects that subscribe to it will 
receive a copy of the message. Each user or device that is 
sharing that document will have a unique Queue. Many 
Publish-subscribe objects are subscribed to by a single 
Queue, but there is a long tail of popular objects that are 
subscribed to by many Queues. 
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Figure 7. Mobility of clients in the Messenger and Mesh 
traces. Most clients do not travel. However, a significant 
fraction do travel quite far. 


cally diverse set of clients, and thus these service’s per- 
formance may significantly benefit from intelligent data 
placement. 


Geographically Distant Data Sharing: We next 
study the traces to understand whether there is significant 
data sharing among distant users. For each particular 
data item, we compute its centroid (centroid on a sphere 
is computed using the weighted spherical mean method- 
ology, which we describe in detail in Section 3). Fig- 
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ure 5 shows a CDF for the distance over which clients ac- 
cess data placed according to its centroid; data that is not 
shared has an access distance of 0, as does data shared by 
users whose IP addresses map to the same geographic lo- 
cation. Given the amount of collaboration across nations 
both within corporations and between them, it is perhaps 
not surprising that large amounts of sharing happens be- 
tween very distant clients. This data suggests that even 
for static clients, there can be significant benefits to plac- 
ing data closest to those who use it most heavily, rather 
than just placing it close to some particular client that 
accesses the data. 


Data Inter-dependencies: We proceed to study 
the traces to understand the prevalence of data inter- 
dependencies. Our analysis focuses on Live Mesh be- 
cause data inter-dependencies in Live Messenger have 
been documented in detail in prior work [24]. Figure 6 
shows the number of Queue objects subscribing to re- 
ceive notifications from each Publish-subscribe object; 
each such subscription creates a data inter-dependency 
where the Publish-subscribe object sends messages to 
the Queue object. We see that some Publish-subscribe 
objects send out notifications to only a single Queue ob- 
ject, but there is a long tail of popular Publish-subscribe 
objects. The presence of such data inter-dependencies 
motivates the need to incorporate them in Volley. 

Client Mobility: We finally study the traces to un- 
derstand the amount of client mobility in these services’ 
client populations. Figure 7 shows a CDF characterizing 
client mobility over the month of the trace. To compute 
this CDF, we first computed the location of each client 
at each point in time that it contacted the Live Mesh 
or Live Messenger application using the previously de- 
scribed methodology, and we then compute the client’s 
centroid. Next, we compute the maximum distance be- 
tween each client and its centroid. As expected, we ob- 
serve that most clients do not move. However, a signifi- 
cant fraction do move (more in the Messenger trace than 
the Mesh trace), and these movements can be quite dra- 
matic — for comparison purposes, antipodal points on the 
earth are slightly more than 12,000 miles apart. 

From these traces, we cannot characterize the reason 
for the movement. For example, it could be travel, or it 
could be that the clients are connecting in through a VPN 
to a remote office, causing their connection to the public 
Internet to suddenly emerge in a dramatically different 
location. For Volley’s goal of reducing client latency, 
there is no need to distinguish between these different 
causes; even though the client did not physically move in 
the VPN case, client latency is still minimized by moving 
data closer to the location of the client’s new connection 
to the public Internet. The long tail of client mobility 
suggests that for some fraction of clients, the ideal data 
placement changes significantly during this month. 
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This data does leave open the possibility that some 
fraction of the observed clients are bots that do not cor- 
respond to an actual user (1.e., they are modified clients 
driven by a program). The current analysis does fil- 
ter out the automated clients that the service itself uses 
for doing performance measurement from various loca- 
tions. Prior work has looked at identifying bots automati- 
cally [45], and Volley might benefit from leveraging such 
techniques. 


3 System Design and Implementation 


The overall flow of data in the system is shown in Fig- 
ure 8. Applications make use of Volley by logging data 
to the Cosmos [5] distributed storage system. The ad- 
ministrator must also supply some inputs, such as a cost 
and capacity model for the datacenters. The Volley sys- 
tem frequently runs new analysis jobs over these logs, 
and computes migration decisions. Application-specific 
jobs then feed these migration decisions into application- 
specific data migration mechanisms. We now describe 
these steps in greater detail. 


3.1 Logging Requests 


To utilize Volley, applications have to log information 
on the requests they process. These logs must enable 
correlating requests into “call trees” or “runtime paths” 
that capture the logical flow of control across compo- 
nents, as in Pinpoint [7] or X-Trace [14]. If the source 
or destination of a request is movable (1.e., because it is 
a data item under the control of the cloud service), we 
log a GUID identifier rather than its IP address; IP ad- 
dresses are only used for endpoints that are not movable 
by Volley, such as the location that a user request came 
from. Because Volley is responsible for placing all the 
data named by GUIDs, it already knows their current lo- 
cations in the steady state. It is sometimes possible for 
both the source and destination of a request to be referred 
to by GUIDs—this would happen, for example, in Fig- 
ure 1, where the GUIDs would refer to Client 1’s Face- 
book wall and Client 2’s Facebook RSS feed. The exact 
fields in the Volley request logs are shown in Table 1. In 
total, each record requires only 100 bytes. 

There has been substantial prior work modifying ap- 
plications to log this kind of information, and many com- 


mercial applications (such as the Live Mesh and Live 
Messenger services analyzed in Section 2) already log 
a superset of this data. For such applications, Volley can 
incorporate simple filters to extract out the relevant sub- 
set of the logs. 

For the Live Mesh and Live Messenger commercial 
cloud services, the data volumes from generating Volley 
logs are much less than the data volumes from processing 
user requests. For example, recording Volley logs for all 
the requests for Live Messenger, an IM service with hun- 
dreds of millions of users, only requires hundreds of GB 
per day, which leads to an average bandwidth demand in 
the tens of Mbps [24]. Though we cannot reveal the ex- 
act bandwidth consumption of the Live Mesh and Live 
Messenger services due to confidentiality concerns, we 
can state that tens of Mbps is a small fraction of the total 
bandwidth demands of the services themselves. Based 
on this calculation, we centralize all the logs in a single 
datacenter; this then allows Volley to run over the logs 
multiple times as part of computing a recommended set 
of migrations. 


3.2 Additional Inputs 


In addition to the request logs, Volley requires four 
inputs that change on slower time scales. Because they 
change on slower time scales, they do not noticeably con- 
tribute to the bandwidth required by Volley. These ad- 
ditional inputs are (1) the requirements on RAM, disk, 
and CPU per transaction for each type of data handled 
by Volley (e.g., a Facebook wall), (2) a capacity and 
cost model for all the datacenters, (3) a model of la- 
tency between datacenters and between datacenters and 
clients, and (4) optionally, additional constraints on data 
placement (e.g., legal constraints). Volley also requires 
the current location of every data item in order to know 
whether a computed placement keeps an item in place or 
requires migration. In the steady state, these locations are 
simply remembered from previous iterations of Volley. 

In the applications we have analyzed thus far, the ad- 
ministrator only needs to estimate the average require- 
ments on RAM, disk and CPU per data item; the ad- 
ministrator can then rely on statistical multiplexing to 
smooth out the differences between data items that con- 
sume more or fewer resources than average. Because of 
this, resource requirements can be estimated by looking 
at OS-provided performance counters and calculating the 
average resource usage for each piece of application data 
hosted on a given server. 

The capacity and cost models for each datacenter 
specify the RAM, disk and CPU provisioned for the ser- 
vice in that datacenter, the available network bandwidth 
for both egress and ingress, and the charging model for 
service use of network bandwidth. While energy usage 
is a significant cost for datacenter owners, in our expe- 
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Request Log Record Format 


Bytes in request (8B) 


Like Source-Entity, either a GUID or an IP address (40B) 


Used to group related requests (8B) 





Table 1. To use Volley, the application logs a record with these fields for every request. The meaning and size in bytes 


of each field are also shown. 


Migration Proposal Record Format 


Latency-Change The average change in latency per request to this object (4B) 
Ongoing-Bandwidth-Change | The change in egress and ingress bandwidth per day (4B) 





Migration-Bandwidth The one-time bandwidth required to migrate (4B) 


Table 2. Volley constructs a set of proposed migrations described using the records above. Volley then selects the 
final set of migrations according to the administrator-defined trade-off between performance and cost. 


rience this is incorporated as a fixed cost per server that 
is factored in at the long timescale of server provision- 
ing. Although datacenter owners may be charged based 
on peak bandwidth usage on individual peering links, the 
unpredictability of any given service’s contribution to a 
datacenter-wide peak leads datacenter owners to charge 
services based on total bandwidth usage, as in Amazon’s 
EC2 [2]. Accordingly, Volley helps services minimize 
their total bandwidth usage. We expect the capacity and 
cost models to be stable at the timescale of migration. 
For fluid provisioning models where additional datacen- 
ter capacity can be added dynamically as needed for a 
service, Volley can be trivially modified to ignore provi- 
sioned capacity limits. 

Volley needs a latency model to make placement de- 
cisions that reduce user perceived latency. It allows dif- 
ferent static or dynamic models to be plugged in. Vol- 
ley migrates state at large timescales (measured in days) 
and hence it should use a latency model that is stable 
at that timescale. Based on the large body of work 
demonstrating the effectiveness of network coordinate 
systems, we designed Volley to treat latencies between 
IPs as distances in some n-dimensional space specified 
by the model. For the purposes of evaluation in the pa- 
per, we rely on a static latency model because it is sta- 
ble over these large timescales. This model is based on 
a linear regression of great-circle distance between geo- 
graphic coordinates; it was developed in prior work [36], 
where it was compared to measured round trip times 
across millions of clients and shown to be reasonably ac- 
curate. This latency model requires translating client IP 
addresses to geographic coordinates, and for this purpose 
we rely on the geo-location database mentioned in Sec- 
tion 2. This geo-location database is updated every two 
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weeks. In this work, we focus on improving latency to 
users and not bandwidth to users. Incorporating band- 
width would require both specifying a desired latency 
bandwidth tradeoff and a model for bandwidth between 
arbitrary points in the Internet. 

Constraints on data placement can come in many 
forms. They may reflect legal constraints that data be 
hosted only in a certain jurisdiction, or they may reflect 
operational considerations requiring two replicas to be 
physically located in distant datacenters. Volley models 
such replicas as two distinct data items that may have 
a large amount of inter-item communication, along with 
the constraint that they be located in different datacen- 
ters. Although the commercial cloud service operators 
we spoke with emphasized the need to accommodate 
such constraints, the commercial applications we study 
in this paper do not currently face constraints of this 
form, and so although Volley can incorporate them, we 
did not explore this in our evaluation. 


3.3. Volley Algorithm 


Once the data is in Cosmos, Volley periodically an- 
alyzes it for migration opportunities. To perform the 
analysis, Volley relies on the SCOPE [5] distributed ex- 
ecution infrastructure, which at a high level resembles 
MapReduce [11] with a SQL-like query language. In our 
current implementation, Volley takes approximately 14 
hours to run through one month’s worth of log files; we 
analyze the demands Volley places on SCOPE in more 
detail in Section 4.4. 

Volley’s SCOPE jobs are structured into three phases. 
The search for a solution happens in Phase 2. Prior 
work [36] has demonstrated that starting this search in 
a good location improves convergence time, and hence 
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Figure 9. Weighted spherical mean calculation. The 
weighted spherical mean (wsm()) is defined recursively 
as a weighted interpolation (interp()) between pairs of 
points. Here, w; is the weight assigned to x;, and x; (the 
coordinates for node 1) consists of @;, the latitudinal dis- 
tance in radians between node 1 and the North Pole, and 
A;, the longitude in radians of node 1. The new (inter- 
polated) node C consists of w parts node A and 1 — w 
parts node B; d is the current distance in radians be- 
tween A and B; 7+ is the angle from the North Pole to B to 
A (which stays the same as A moves); £ is the angle from 
B to the North Pole to A’s new location. These are used 
to compute Xc, the result of the interp(). For simplicity 
of presentation, we omit describing the special case for 
antipodal nodes. 


Phase 1 computes a reasonable initial placement of data 
items based on client IP addresses. Phase 2 iteratively 
improves the placement of data items by moving them 
freely over the surface of the earth—this phase requires 
the bulk of the computational time and the algorithm 
code. Phase 3 does the needed fix up to map the data 
items to datacenters and to satisfy datacenter capacity 
constraints. The output of the jobs is a set of poten- 
tial migration actions with the format described in Ta- 
ble 2. Many adaptive systems must incorporate explicit 
elements to prevent oscillations. Volley does not incor- 
porate an explicit mechanism for oscillation damping. 
Oscillations would occur only if user behavior changed 
in response to Volley migration in such a way that Vol- 
ley needed to move that user’s state back to a previous 
location. 


Phase 1: Compute Initial Placement. We first map 
each client to a set of geographic coordinates using the 
commercial geo-location database mentioned earlier. 
This IP-to-location mapping may be updated between 
Volley jobs, but it is not updated within a single Volley 
job. We then map each data item that is directly accessed 
by a client to the weighted average of the geographic 
coordinates for the client IPs that access it. This is done 
using the weighted spherical mean calculation shown 
in Figure 9. The weights are given by the amount of 
communication between the client nodes and the data 
item whose initial location we are calculating. The 
weighted spherical mean calculation can be thought 
of as drawing an arc on the earth between two points, 
and then finding the point on the arc that interpolates 
between the two initial points in proportion to their 
weight. This operation is then repeated to average in 
additional points. The recursive definition of weighted 
spherical mean in Figure 9 is conceptually similar to 
defining the more familiar weighted mean recursively, 


62, 


weighted-mean({3, 7%}, {2,2;},{1,2:}) = 


€ -2E+ - weighted-mean({2, x; }, {1,25})) 
Compared to weighted mean, weighted spherical mean 
has the subtlety that the rule for averaging two individual 
points has to use spherical coordinates. 

Figure 10 shows an example of this calculation us- 
ing data from the Live Mesh trace: five different devices 
access a single shared object from a total of eight dif- 
ferent IP addresses; device D accesses the shared object 
far more than the other devices, and this leads to the 
weighted spherical mean (labeled “centroid” in the fig- 
ure) being placed very close to device D. 

Finally, for each data item that is never accessed di- 
rectly by clients (e.g., the Publish-subscribe data item in 
the Live Mesh example of Figure 1), we map it to the 
weighted spherical mean of the data items that commu- 
nicate with it using the positions these other items were 
already assigned. 

Phase 2: Iteratively Move Data to Reduce Latency. 
Volley iteratively moves data items closer to both clients 
and to the other data items that they communicate with. 
This iterative update step incorporates two earlier ideas: 
a weighted spring model as in Vivaldi [9] and spherical 
coordinates as in Htrae [36]. Spherical coordinates de- 
fine the locations of clients and data items in a way that 
is more conducive to incorporating a latency model for 
geographic locations. The latency distance between two 
nodes and the amount of communication between them 
increase the spring force that is pulling them together. 
However, unlike a network coordinate system, nodes in 
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Figure 10. An example of a shared object being placed at its weighted spherical mean (labeled “centroid” in the 
Figure). This particular object, the locations of the clients that access it, and their access ratios are drawn from 
the Live Mesh trace. Because device D is responsible for almost all of the accesses, the weighted spherical mean 


placement for the object is very close to device D’s location. 
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Figure 11. Update rule applied to iteratively move 
nodes with more communication closer together. Here, w 
is a fractional weight that determines how much node A 
is moved towards node B, lap is the amount of commu- 
nication between the two nodes, d is the distance between 
nodes A and B, £4!"°"°™ and £3'""*™ are the current lo- 
cations of node A and B, «"°" is the location of A after 
the update, and k is an algorithmic constant. 


Volley only experience contracting forces; the only fac- 
tor preventing them from collapsing to a single location 
is the fixed nature of client locations. This yields the 
update rule shown in Figure 11. In our current imple- 
mentation, we simply run a fixed number of iterations of 
this update rule; we show in Section 4 that this suffices 
for good convergence. 

Intuitively, Volley’s spring model attempts to bring 
data items closer to users and to other data items that 
they communicate with regularly. Thus it is plausible 
that Volley’s spring model will simultaneously reduce 
latency and reduce inter-datacenter traffic; we show in 
Section 4 that this is indeed the case for the commercial 
cloud services that we study. 

Phase 3: Iteratively Collapse Data to Datacenters. 
After computing a nearly ideal placement of the data 
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items on the surface of the earth, we have to modify this 
placement so that the data items are located in datacen- 
ters, and the set of items in each datacenter satisfies its 
capacity constraints. Like Phase 2, this is done itera- 
tively: initially, every data item is mapped to its clos- 
est datacenter. For datacenters that are over their capac- 
ity, Volley identifies the items that experience the fewest 
accesses, and moves all of them to the next closest data- 
center. Because this may still exceed the total capacity of 
some datacenter due to new additions, Volley repeats the 
process until no datacenter is over capacity. Assuming 
that the system has enough capacity to successfully host 
all items, this algorithm always terminates in at most as 
many iterations as there are datacenters in the system. 
For each data item that has moved, Volley outputs 
a migration proposal containing the new datacenter lo- 
cation, the new values for latency and ongoing inter- 
datacenter bandwidth, and the one-time bandwidth re- 
quired for this migration. This is a straightforward cal- 
culation using the old data locations, the new data lo- 
cations, and the inputs supplied by the datacenter ser- 
vice administrator, such as the cost model and the latency 
model. These migration proposals are then consumed by 
application-specific migration mechanisms. 


3.4 Application-specific Migration 


Volley is designed to be usable by many different 
cloud services. For Volley to compute a recommended 
placement, the only requirement it imposes on the cloud 
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service is that it logs the request data described in Ta- 
ble 1. Given these request logs as input, Volley outputs a 
set of migration proposals described in Table 2, and then 
leaves the actual migration of the data to the cloud ser- 
vice itself. If the cloud service also provides the initial 
location of data items, then each migration proposal will 
include the bandwidth required to migrate, and the ex- 
pected change in latency and inter-datacenter bandwidth 
after migration. 

Volley’s decision to leave migration to application- 
specific migration mechanisms allows Volley to be more 
easily applied to a diverse set of datacenter applications. 
For example, some datacenter applications use migra- 
tion mechanisms that follow the pattern of marking data 
read-only in the storage system at one location, copy- 
ing the data to a new location, updating an application- 
specific name service to point to the new copy, marking 
the new copy as writeable, and then deleting the old copy. 
Other datacenter applications maintain multiple replicas 
in different datacenters, and migration may simple re- 
quire designating a different replica as the primary. In- 
dependent of the migration mechanism, datacenter appli- 
cations might desire to employ application-specific throt- 
tling policies, such as only migrating user state when an 
application-specific predictive model suggests the user is 
unlikely to access their state in the next hour. Because 
Volley does not attempt to migrate the data itself, it does 
not interfere with these techniques or any other migration 
technique that an application may wish to employ. 


4 Evaluation 


In our evaluation, we compare Volley to three heuris- 
tics for where to place data and show that Volley sub- 
stantially outperforms all of them on the metrics of dat- 
acenter capacity skew, inter-datacenter traffic, and user- 
perceived latency. We focus exclusively on the month- 
long Live Mesh trace for conciseness. For both the 
heuristics and Volley, we first compute a data placement 
using a week of data from the Live Mesh trace, and then 
evaluate the quality of the resulting placement on the 
following three weeks of data. For all four placement 
methodologies, any data that appears in the three-week 
evaluation window but not in the one-week placement 
computation window is placed in a single datacenter lo- 
cated in the United States (in production, this new data 
will be handled the next time the placement methodol- 
ogy is run). Placing all previously unseen data in one 
datacenter penalizes the different methodologies equally 
for such data. 

The first heuristic we consider is commonIP — place 
data as close as possible to the IP address that most com- 
monly accesses it. The second heuristic is oneDC — put 
all data in one datacenter, a strategy still taken by many 
companies due to its simplicity. The third heuristic is 


hash — hash data to datacenters so as to optimize for 
load-balancing. These three heuristics represent reason- 
able approaches to optimizing for the three different met- 
rics we consider—oneDC and hash optimize for inter- 
datacenter traffic and datacenter capacity skew respec- 
tively, while commonIP is a reasonably sophisticated 
proposal for optimizing latency. 

Throughout our evaluation, we use 12 commercial 
datacenters as potential locations. These datacenters are 
distributed across multiple continents, but their exact lo- 
cations are confidential. Confidentiality concerns also 
prevent us from revealing the exact amount of bandwidth 
consumed by our services. Thus, we present the inter- 
datacenter traffic from different placements using the 
metric “fraction of messages that are inter-datacenter.” 
This allows an apples-to-apples comparison between the 
different heuristics and Volley without revealing the un- 
derlying bandwidth consumption. The bandwidth con- 
sumption from centralizing Volley logs, needed for Vol- 
ley and commonIP, is so small compared to this inter- 
datacenter traffic that it does not affect graphs compar- 
ing this metric among the heuristics. We configure Vol- 
ley with a datacenter capacity model such that no one of 
the 12 datacenters can host more than 10% of all data, a 
reasonably balanced use of capacity. 


All latencies that we compute analytically use the la- 
tency model described in Section 3. This requires using 
the client’s IP address in the trace to place them at a ge- 
ographic location. In this Live Mesh application, client 
requests require sending a message to a first data item, 
which then sends a second message to a second data 
item; the second data item sends a reply, and then the 
first data item sends the client its reply. If the data items 
are in the same datacenter, latency is simply the round 
trip time between the client and the datacenter. If the 
data items are in separate datacenters, latency is the sum 
of four one-way delays: client to datacenter 1, datacen- 
ter 1 to datacenter 2, datacenter 2 back to datacenter 1, 
and datacenter | back to the client. These latency calcu- 
lations leave out other potential protocol overheads, such 
as the need to initially establish a TCP connection or to 
authenticate; any such protocol overheads encountered 
in practice would magnify the importance of latency im- 
provements by incurring the latency multiple times. For 
clarity of presentation, we consistently group latencies 
into 10 millisecond bins in our graphs. The graphs only 
present latency up to 250 milliseconds because the better 
placement methodologies all achieve latency well under 
this for almost all requests. 

Our evaluation begins by comparing Volley and the 
three heuristics on the metrics of datacenter capacity 
skew and inter-datacenter traffic (Section 4.1). Next, we 
evaluate the impact of these placements on the latency of 
client requests, including evaluating Volley in the con- 
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Figure 12. Datacenter capacity required by three dif- 
ferent placement heuristics and Volley. 
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Figure 13. /nter-datacenter traffic under three different 
placement heuristics and Volley. 


text of a simple, hypothetical example to understand this 
impact in detail (Section 4.2). We then evaluate the in- 
cremental benefit of Volley as a function of the num- 
ber of Volley iterations (Section 4.3). Next, we evaluate 
the resource demands of running Volley on the SCOPE 
distributed execution infrastructure (Section 4.4). Fi- 
nally, we evaluate the impact of running Volley more fre- 
quently or less frequently (Section 4.5). 


4.1 Impact on Datacenter Capacity Skew and 
Inter-datacenter Traffic 


We now compare Volley to the three heuristics for 
where to place data and show that Volley substantially 
outperforms all of them on the metrics of datacenter 
capacity skew and inter-datacenter traffic. Figures 12 
and 13 show the results: hash has perfectly balanced 
use of capacity, but high inter-datacenter traffic; oneDC 
has zero inter-datacenter traffic (the ideal), but extremely 
unbalanced use of capacity; and commonIP has a mod- 
est amount of inter-datacenter traffic, and capacity skew 
where | datacenter has to support more than twice the 
load of the average datacenter. Volley is able to meet a 
reasonably balanced use of capacity while keeping inter- 
datacenter traffic at a very small fraction of the total num- 
ber of messages. In particular, compared to commonIP, 
Volley reduces datacenter skew by over 2x and reduces 
inter-datacenter traffic by over 1.8 x. 
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Figure 14. Client request latency under three different 
placement heuristics and Volley. 


4.2 Impact on Latency of Client Requests 


We now compare Volley to the three heuristics on the 
metric of user-perceived latency. Figure 14 shows the 
results: hash has high latency; oneDC has mediocre la- 
tency; and commonIP has the best latency among the 
three heuristics. Although commonIP performs better 
than oneDC and hash, Volley performs better still, par- 
ticularly on the tail of users that experience high latency 
even under the commonIP placement strategy. Com- 
pared to commonIP, Volley reduces 75th percentile la- 
tency by over 30%. 


4.2.1 


Previously, we evaluated the impact of placement on 
user-perceived latency analytically use the latency model 
described in Section 3. In this section, we evaluate Vol- 
ley’s latency impact on a live system using a prototype 
cloud service. We use the prototype cloud service to em- 
ulate Live Mesh for the purpose of replaying a subset of 
the Live Mesh trace. We deployed the prototype cloud 
service across 20 virtual machines spread across the 12 
geographically distributed datacenters, and we used one 
node at each of 109 Planetlab sites to act as clients of the 
system. 

The prototype cloud service consists of four compo- 
nents: the frontend, the document service, the publish- 
subscribe service, and the message queue service. Each 
of these components run on every VM so as to have ev- 
ery service running in every datacenter. These compo- 
nents of our prototype map directly to the actual Live 
Mesh component services that run in production. The 
ways in which the production component services co- 
operate to provide features in the Live Mesh service is 
described in detail elsewhere [1], and we provide only a 
brief overview here. 

The prototype cloud service exposes a simple fron- 
tend that accepts client requests and routes them to the 
appropriate component in either its own or another data- 
center. In this way, each client can connect directly to 
any datacenter, and requests that require an additional 
step (e.g., updating an item, and then sending the update 
to others) will be forwarded appropriately. This design 
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Figure 15. Comparing Volley to the commonIP heuristic 
on a live system spanning 12 geographically distributed 
datacenters and accessed by Planetlab clients. In this 
Figure, we use a random sample of the Live Mesh trace. 
We see that Volley provides moderately better latency 
than the commonIP heuristic. 


allows clients to cache the location of the best datacen- 
ter to connect to for any given operation, but requests still 
succeed if a client request arrives at the wrong datacenter 
due to cache staleness. 


We walk through an example of how two clients can 
rendezvous by using the document, publish-subscribe, 
and message queue services. The document service can 
store arbitrary data; in this case, the first client can store 
its current IP address, and a second client can then read 
that IP address from the document service and contact 
the first client directly. The publish-subscribe service is 
used to send out messages when data in the document 
service changes; for example, if the second client sub- 
scribes to updates for the first client’s IP address, these 
updates will be pro-actively sent to the second client, in- 
stead of the second client having to poll the document 
service to see if there have been any changes. Finally, the 
message queue service buffers messages for clients from 
the publish-subscribe service. If the client goes offline 
and then reconnects, it can connect to the queue service 
and dequeue these messages. 


To evaluate both Volley and the commonIP heuristic’s 
latency on this live system, we used the same data place- 
ments computed on the first week of the Live Mesh trace. 
Because the actual Live Mesh service requires more than 
20 VMs, we had to randomly sample requests from the 
trace before replaying it. We also mapped each client 
IP in the trace subset to the closest Planetlab node, and 
replayed the client requests from these nodes. 


Figure 15 shows the measured latency on the sample 
of the Live Mesh trace; recall that we are grouping laten- 
cies into 10 millisecond bins for clarity of presentation. 
We see that Volley consistently provides better latency 
than the commonIP placement. These latency benefits 
are visible despite a relatively large number of external 
sources of noise, such as the difference between the ac- 
tual client locations and the Planetlab locations, differ- 
ences between typical client connectivity (that Volley’s 
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Table 3. Hypothetical application logs. In this example, 
IP! is located at geographic coordinates (10,110) and 
IP? at (10,10). 
Phase 1 Phase 2 

PSS*® | (10,110) (14.7,43.1) | (15.1,49.2) 

PSS® | (10,10) (15.3,65.6):-| (13.3;63:6) 

c* (10,10) (14.7,43.1) | (15.1,50.5) 

OF (10,110) (10,110) (13.6,88.4) 
Table 4. CommonIP and Volley placements computed 
using Table 3, assuming a datacenter at every point 
on Earth and ignoring capacity constraints and inter- 
datacenter traffic. 


Transaction- 
Id 





commonIP 
distance | latency 
27,070 miles 


Volley Phase 2 
distance | latency 
8,202 miles 
7,246 miles 
7,246 miles 
6,289 miles 


0 miles 
0 miles 
13,535 miles 


Table 5. Distances traversed and latencies of user re- 
quests in Table 3 using commonIP and Volley Phase 2 
placements in Table 4. Note that our latency model [36] 
includes an empirically-determined access penalty for 
all communication involving a client. 


latency model relies on) and Planetlab connectivity, and 
occasional high load on the Planetlab nodes leading to 
high slice scheduling delays. 

Other than due to sampling of the request trace, the 
live experiment has no impact on the datacenter capac- 
ity skew and inter-datacenter traffic differences between 
the two placement methodologies. Thus, Volley offers 
an improvement over commonIP on every metric simul- 
taneously, with the biggest benefits coming in reduced 
inter-datacenter traffic and reduced datacenter capacity 
skew. 


4.2.2 Detailed Examination of Latency Impact 


To examine in detail how placement decisions im- 
pact latencies experienced by user requests, we now con- 
sider a simple example. Table 3 lists four hypothetical 
Live Mesh transactions involving four data objects and 
clients behind two IP addresses. For the purposes of this 
simple example, we assume there is a datacenter at ev- 
ery point on Earth with infinite capacity and no inter- 
datacenter traffic costs. We pick the geographic coor- 
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Figure 16. Average distance traveled by each object 
during successive Volley Phase 2 iterations. The aver- 
age incorporates some objects traveling quite far, while 
many travel very little. 


dinates of (10,110) and (10,10) for ease of examining 
how far each object’s placement is from the client IP 
addresses. Table 4 shows the placements calculated by 
commonIP and Volley in Phases | and 2. In Phase 1, 
Volley calculates the weighted spherical mean of the ge- 
ographic coordinates for the client IPs that access each 
“Q” object. Hence Q?! is placed roughly two-thirds along 
the great-circle segment from IP! to IP?, while Q? is 
placed at JP'. Phase 1 similarly calculates the place- 
ment of each “PSS” object using these coordinates for 
“Q” objects. Phase 2 then iteratively refines these coor- 
dinates. 

We now consider the latency impact of these place- 
ments on the same set of user requests in Table 3. Ta- 
ble 5 shows for each user request, the physical distance 
traversed and the corresponding latency (round trip from 
PSS to Q and from Q to IP). CommonIP optimizes for 
client locations that are most frequently used, thereby 
driving down latency to the minimum for some user re- 
quests but at a significant expense to others. Volley con- 
siders all client locations when calculating placements, 
and in doing so drives down the worst cases by more than 
the amount it drives up the common case, leading to an 
overall better latency distribution. Note that in practice, 
user requests change over time after placement decisions 
have been made and our trace-based evaluation does use 
later sets of user requests to evaluate placements based 
on earlier requests. 


4.3. Impact of Volley Iteration Count 


We now show that Volley converges after a small 
number of iterations; this will allow us to establish in 
Section 4.5 that Volley runs quickly (i.e., less than a day), 
and thus can be re-run frequently. Figures 16, 17, 18 
and 19 show the performance of Volley as the number 
of iterations varies. Figure 16 shows that the distance 
that Volley moves data significantly decreases with each 
Volley iteration, showing that Volley relatively quickly 
converges to its ideal placement of data items. 

Figures 17, 18 and 19 further break down the changes 
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Figure 17. Inter-datacenter traffic at each Volley itera- 
tion. 
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Figure 18. Datacenter capacity at each Volley iteration. 
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Figure 19. Client request latency at each Volley itera- 
tion. 


in Volley’s performance in each iteration. Figure 17 
shows that inter-datacenter traffic 1s reasonably good af- 
ter the initial placement of Phase 1, and is quite similar to 
the commonIP heuristic. In contrast, recall that the hash 
heuristic led to almost 80% of messages crossing data- 
center boundaries. Inter-datacenter traffic then decreases 
by over a factor of 2 during the first 5 Phase 2 itera- 
tions, decreases by a small amount more during the next 
5 Phase 2 iterations, and finally goes back up slightly 
when Volley’s Phase 3 balances the items across data- 
centers. Of course, the point of re-balancing is to avoid 
the kind of capacity skew seen in the commonIP heuris- 
tic, and in this regard a small increase in inter-datacenter 
traffic is acceptable. 


Turning now to datacenter capacity, we see that Vol- 
ley’s placement is quite skewed (and by an approxi- 
mately constant amount) until Phase 3, where it smooths 
out datacenter load according to its configured capacity 


USENIX Association 


USENIX Association 


Volley Elapsed 
Phase Time 
in Hours 


SCOPE SCOPE CPU 
Stages | Vertices | Hours 
1:22 20,668 | 8 

1413 255,228 
Table 6. Volley’s demands on the SCOPE infrastructure 
to analyze I week’s worth of traces. 





model (1.e., such that no one of the 12 datacenters hosts 
more than 10% of the data). Turning finally to latency, 
Figure 19 shows that latency has reached its minimum 
after only five Phase 2 iterations. In contrast to the im- 
pact on inter-datacenter traffic, there is almost no latency 
penalty from Phase 3’s data movement to satisfy data- 
center capacity. 


4.4 Volley Resource Demands 


Having established that Volley converges after a small 
number of iterations, we now analyze the resource re- 
quirements for this many iterations; this will allow us to 
conclude that Volley completes quickly and can be re-run 
frequently. The SCOPE cluster we use consists of well 
over 1,000 servers. Table 6 shows Volley’s demands on 
the SCOPE infrastructure broken down by Volley’s dif- 
ferent phases. The elapsed time, SCOPE stages, SCOPE 
vertices and CPU hours are cumulative over each phase 
— Phase | has only one iteration to compute the initial 
placement, while Phase 2 has ten iterations to improve 
the placement, and Phase 3 has 12 iterations to balance 
out usage over the 12 datacenters. Each SCOPE stage 
in Table 6 corresponds approximately to a single map 
or reduce step in MapReduce [11]. There are 680 such 
stages overall, leading to lots of data shuffling; this is 
one reason why the total elapsed time is not simply CPU 
hours divided by the degree of possible parallelism. Ev- 
ery SCOPE vertex in Table 6 corresponds to a node in 
the computation graph that can be run on a single ma- 
chine, and thus dividing the total number of vertices by 
the total number of stages yields the average degree of 
parallelism within Volley: the average stage parallelizes 
out to just over 406 machines (some run on substantially 
more). The SCOPE cluster is not dedicated for Volley but 
rather is a multi-purpose cluster used for several tasks. 
The operational cost of using the cluster for 16 hours ev- 
ery week is small compared to the operational savings 
in bandwidth consumption due to improved data place- 
ment. The data analyzed by Volley is measured in the 
terabytes. We cannot reveal the exact amount because it 
could be used to infer confidential request volumes since 
every Volley log record is 100 bytes. 


4.5 Impact of Rapid Volley Re-Computation 


Having established that Volley can be re-run fre- 
quently, we now show that Volley provides substantially 
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Figure 20. Client request latency with stale Volley 
placements. 
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Figure 21. Inter-datacenter traffic with stale Volley 
placements. 
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Figure 22. Previously unseen objects over time. 


better performance by being re-run frequently. For these 
experiments, we use traces from the Live Mesh service 
extending from the beginning of June 2009 all the way 
to the beginning of September 2009. Figures 20, 21 
and 22 show the impact of rapidly re-computing place- 
ments: Volley computes a data placement using the trace 
from the first week of June, and we evaluate the per- 
formance of this placement on a trace from the imme- 
diately following week, the week after the immediately 
following week, a week starting a month later, and a 
week starting three months after Volley computed the 
data placement. The better performance of the place- 
ment on the immediately following week demonstrates 
the significant benefits of running Volley frequently with 
respect to both latency and inter-datacenter traffic. Fig- 
ure 20 shows that running Volley even every two weeks 
is noticeably worse than having just run Volley, and this 
latency penalty keeps increasing as the Volley placement 
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Figure 23. Fraction of objects moved compared to first 
week. 


becomes increasingly stale. Figure 21 shows a similar 
progressively increasing penalty to inter-datacenter traf- 
fic; running Volley frequently results in significant inter- 
datacenter traffic savings. 

Figure 22 provides some insight into why running 
Volley frequently is so helpful; the number of previously 
unseen objects increases rapidly with time. When run 
frequently, Volley detects accesses to an object sooner. 
Note that this inability to intelligently place previously 
unseen objects is shared by the commonIP heuristic, and 
so we do not separately evaluate the rate at which it de- 
grades in performance. 

In addition to new objects that are created and ac- 
cessed, previously placed objects may experience sig- 
nificantly different access patterns over time. Running 
Volley periodically provides the added benefit of migrat- 
ing these objects to locations that can better serve new 
access patterns. Figure 23 compares a Volley placement 
calculated from the first week of June to a placement cal- 
culated in the second week, then the first week to the 
third week, and finally the first week to the fourth week. 
About 10% of the objects in any week undergo migra- 
tions, either as a direct result of access pattern changes 
or due to more important objects displacing others in 
capacity-limited datacenters. The majority of objects re- 
tain their placement compared to the first week. Run- 
ning Volley periodically has a third, but minor advan- 
tage. Some client requests come from IP addresses that 
are not present in geo-location databases. Objects that 
are accessed solely from such locations are not placed by 
Volley. If additional traces include accesses from other 
IP addresses that are present in geo-location databases, 
Volley can then place these objects based on these new 
accesses. 


5 Related Work 


The problem of automatic placement of application 
data re-surfaces with every new distributed computing 
environment, such as local area networks (LANs), mo- 
bile computing, sensor networks, and single cluster web 
sites. In characterizing related work, we first focus on the 
mechanisms and policies that were developed for these 
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other distributed computing environments. We then de- 
scribe prior work that focused on placing static content 
on CDNs; compared to this prior work, Volley is the first 
research system to address placement of dynamic appli- 
cation data across geographically distributed datacenters. 
We finally describe prior work on more theoretical ap- 
proaches to determining an optimal data placement. 


5.1 Placement Mechanisms 


Systems such as Emerald [20], SOS [37], Globe [38], 
and Legion [25] focused on _ providing location- 
independent programming abstractions and migration 
mechanisms for moving data and computation between 
locations. Systems such as J-Orchestra [42] and Addis- 
tant [41] have examined distributed execution of Java ap- 
plications through rewriting of byte code, but have left 
placement policy decisions to the user or developer. In 
contrast, Volley focuses on placement policy, not mech- 
anism. Some prior work incorporated both placement 
mechanism and policy, e.g., Coign [18], and we charac- 
terize its differences with Volley’s placement policy in 
the next subsection. 


5.2 Placement Policies for Other Distributed 
Computing Environments 


Prior work on automatic data placement can be 
broadly grouped by the distributed computing environ- 
ment that it targeted. Placing data in a LAN was tackled 
by systems such as Coign [18], IDAP [22], ICOPS [31], 
CAGES [17], Abacus [3] and the system of Stewart et 
al [39]. Systems such as Spectra [13], Slingshot [40], 
MagnetOS [27], Pleaides [23] and Wishbone [33] ex- 
plored data placement in a wireless context, either be- 
tween mobile clients and more powerful servers, or in ad 
hoc and sensor networks. Hilda [44] and Doloto [30] 
explored splitting data between web clients and web 
servers, but neither assumed there were multiple geo- 
graphic locations that could host the web server. 


Volley differs from these prior systems in several 
ways. First, the scale of the data that Volley must process 
is significantly greater. This required designing the Vol- 
ley algorithm to work in a scalable data analysis frame- 
work such as SCOPE [5] or MapReduce [11]. Second, 
Volley must place data across a large number of datacen- 
ters with widely varying latencies both between datacen- 
ters and clients, and between the datacenters themselves; 
this aspect of the problem is not addressed by the algo- 
rithms in prior work. Third, Volley must continuously 
update its measurements of the client workload, while 
some (though not all) of these prior approaches used an 
upfront profiling approach. 
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5.3. Placement Policies for Static Data 


Data placement for Content Delivery Networks 
(CDNs) has been explored in many pieces of prior 
work [21, 19]. These systems have focused on static data 
— the HTTP caching header should be honored, but no 
other more elaborate synchronization between replicas is 
needed. Because of this, CDNs can easily employ decen- 
tralized algorithms e.g., each individual server or a small 
set of servers can independently make decisions about 
what data to cache. In contrast, Volley’s need to deal 
with dynamic data would make a decentralized approach 
challenging; Volley instead opts to collect request data in 
a single datacenter and leverage the SCOPE distributed 
execution framework to analyze the request logs within 
this single datacenter. 


5.4 Optimization Algorithms 


Abstractly, Volley seeks to maps objects to locations 
so as to minimize a cost function. Although there are no 
known approximation algorithms for this general prob- 
lem, the theory community has developed approximation 
algorithms for numerous more specialized settings, such 
as sparsest cut [4] and various flavors of facility loca- 
tion [6, 8]. To the best of our knowledge, the problem 
in Volley does not map to any of these previously stud- 
ied specializations. For example, the problem in Volley 
differs from facility location in that there is a cost asso- 
ciated with placing two objects at different datacenters, 
not just costs between clients and objects. This moti- 
vates Volley’s choice to use a heuristic approach and to 
experimentally validate the quality of the resulting data 
placement. 

Although Volley offers a significant improvement 
over a State-of-the-art heuristic, we do not yet know how 
close it comes to an optimal placement; determining such 
an optimal placement is challenging because standard 
commercial optimization packages simply do not scale 
to the data sizes of large cloud services. This leaves open 
the tantalizing possibility that further improvements are 
possible beyond Volley. 


6 Conclusion 


Cloud services continue to grow to span large num- 
bers of datacenters, making it increasingly urgent to de- 
velop automated techniques to place application data 
across these datacenters. Based on the analysis of month- 
long traces from two large-scale commercial cloud ser- 
vices, Microsoft’s Live Messenger and Live Mesh, we 
built the Volley system to perform automatic data place- 
ment across geographically distributed datacenters. To 
scale to the large data volumes of cloud service logs, Vol- 
ley is designed to work in the SCOPE [5] scalable data 
analysis framework. 


We evaluate Volley analytically and on a live system 
consisting of a prototype cloud service running on a ge- 
ographically distributed testbed of 12 datacenters. Our 
evaluation using one of the month-long traces shows that, 
compared to a state-of-the-art heuristic, Volley simulta- 
neously reduces datacenter capacity skew by over 2 x, re- 
duces inter-datacenter traffic by over 1.8x, and reduces 
75th percentile latency by over 30%. This shows the po- 
tential of Volley to simultaneously improve the user ex- 
perience and significantly reduce datacenter costs. 

While in this paper we have focused on using Volley 
to optimize data placement in existing datacenters, ser- 
vice operators could also use Volley to explore future 
sites for datacenters that would improve performance. 
By including candidate locations for datacenters in Vol- 
ley’s input, the operator can identify which combination 
of additional sites improve latency at modest costs in 
greater inter-datacenter traffic. We hope to explore this 
more in future work. 
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Abstract— We present a method to jointly optimize 
the cost and the performance of delivering traffic from 
an online service provider (OSP) network to its users. 
Our method, called Entact, is based on two key tech- 
niques. First, it uses a novel route-injection mechanism 
to measure the performance of alternative paths that are 
not being currently used, without disturbing current traf- 
fic. Second, based on the cost, performance, traffic, and 
link capacity information, it computes the optimal cost 
vs. performance curve for the OSP. Each point on the 
curve represents a potential operating point for the OSP 
such that no other operating point offers a simultaneous 
improvement in cost and performance. The OSP can 
then pick the operating point that represents the desired 
trade-off (e.g., the “sweet spot’). We evaluate the benefit 
and overhead of Entact using trace-driven evaluation in 
a large OSP with 11 geographically distributed data cen- 
ters. We find that by using Entact this OSP can reduce its 
traffic cost by 40% without any increase in path latency 
and with acceptably low overheads. 


1 Introduction 


Providers of online services such as search, maps, and 
instant messaging are experiencing an enormous growth 
in demand. Google attracts over 5 billion search queries 
per month [2], and Microsoft’s Live Messenger attracts 
over 330 million active users each month [5]. To satisfy 
this global demand, online service providers (OSPs) op- 
erate a network of geographically dispersed data centers 
and connect with many Internet service providers (ISPs). 
Different users interact with different data centers, and 
ISPs help the OSPs carry traffic to and from the users. 


Two key considerations for OSPs are the cost and the 
performance of delivering traffic to its users. Large OSPs 
such as Google, Microsoft, and Yahoo! send and receive 
traffic that exceeds a petabyte per day. Accordingly, they 
bear huge costs to transport data. 
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While cost is clearly of concern, performance of traf- 
fic 1s critical as well because revenue relies directly on it. 
Even small increments in user-experienced delay (e.g., 
page load time) can lead to significant loss in revenue 
through a reduction in purchases, search queries, or ad- 
vertisement click-through rates [20]. Because applica- 
tion protocols involve multiple round trips, small incre- 
ments in path latency can lead to large increments in 
user-experienced delay. 


The richness of OSP networks makes it difficult to op- 
timize the cost and performance of traffic. There are nu- 
merous destination prefixes and numerous choices for 
mapping users to data centers and for selecting ISPs. 
Each choice has different different cost and performance 
characteristics. For instance, while some ISPs are free, 
some are exorbitantly expensive. Making matters worse, 
cost and performance must be optimized jointly because 
the trade-off between the two factors can be complex. We 
show that optimizing for cost alone leads to severe per- 
formance degradation and optimizing for performance 
alone leads to significant cost. 


To our knowledge, no automatic traffic engineering 
(TE) methods exist today for OSP networks. TE for 
OSPs requires a different formulation than that for tran- 
sit ISPs or multihomed stub networks. In the traditional 
intra-domain TE for transit ISPs, the goal is to balance 
load across multiple internal paths [13, 18,23]. End-to- 
end user performance is not considered. 


Unlike multihomed stub networks, OSPs can source 
traffic from any of their multiple data centers. This 
flexibility adds a completely new dimension to the op- 
timization. Further, large OSPs connect to hundreds of 
ISPs — two orders of magnitude more than multihomed 
stub networks — which calls for highly scalable solu- 
tions. Another assumption in TE schemes for multi- 
homed sites [7, 8, 15] is that each connected ISP offers 
paths to all Internet destinations. This assumption is not 
valid in the OSP context. 


NSDI 710: 7th USENIX Symposium on Networked Systems Design and Implementation 


2D 


34 


Given the limitations of the current TE methods, the 
state of the art for optimizing traffic in OSP networks 
is rather rudimentary. Operators manually configure a 
delicate balance between cost and performance. Because 
of the complexity of large OSP networks, the operating 
point thus achieved can be far from desirable. 


We present the design and evaluation of Entact, the 
first TE scheme for OSP networks. We identify and ad- 
dress two primary challenges in realizing such a scheme. 
First, because the interdomain routing protocol (BGP) 
does not include performance information, performance 
is unknown for paths that can be used but are not be- 
ing currently used. We must estimate the performance of 
such paths without actually redirecting traffic to them as 
redirection can be disruptive. We overcome this chal- 
lenge via a novel route injection technique. To mea- 
sure an unused path for a prefix, Entact selects an IP ad- 
dress 7p within the prefix and installs a route for ip/32 
to routers in the OSP network. Because of the longest- 
prefix match rule, packets destined to 2p will follow the 
installed route while the rest of the traffic will continue 
to use the current route. 


The second challenge is to use the cost, performance, 
traffic volume, and link capacity information to find in 
real time a TE strategy that matches the OSP’s goals. 
Previous algorithmic studies of route selection optimize 
one of the two metrics, performance or cost, with the 
other as the fixed constraint. However, from conver- 
sations with the operators of a large OSP, we learned 
that often there is no obvious answer for which met- 
ric should be selected as the fixed constraint, as profit 
depends on the complex trade-off between performance 
and cost. Entact uses a novel joint optimization tech- 
nique that finds the entire trade-off curve and lets the op- 
erator pick a desirable point on that curve. Such a tech- 
nique provides operators with useful insight and a range 
of options for configuring the network as desired. 


We demonstrate the benefits of Entact in Microsoft’s 
global network (MSN), one of the largest OSPs today. 
Because we are not allowed to arbitrarily change the 
paths used by various prefixes, we conduct a trace-driven 
study. We implement the key components of Entact and 
measure the relevant routing, traffic, and performance in- 
formation. We use this information to simulate Entact- 
based TE in MSN. We find that compared to the com- 
mon (manual) practices today, Entact can reduce the total 
traffic cost by up to 40% without compromising perfor- 
mance. We also find that these benefits can be realized 
with low overhead. Exploring two closest data centers 
for each destination prefix and one non-default route at 
each data center tends to be enough, and changing routes 
once per hour tends to be enough. 
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Figure 1: Typical network architecture of a large OSP. 


2 Traffic Cost and Performance for OSPs 


In this section, we describe the architecture of a typical 
OSP network. We also outline the unique cost and per- 
formance optimization opportunities that arise in OSP 
networks by exploiting the presence of a diverse set of 
alternative paths for transporting service traffic. 


2.1 OSP network architecture 


Figure 1 illustrates the typical network architecture of 
large OSPs. To satisfy global user demand, such OSPs 
have data centers (DCs) in multiple geographical loca- 
tions. Each DC hosts a large number of servers, any- 
where from several hundreds to hundreds of thousands. 
For cost, performance, and robustness, each DC is con- 
nected to many ISPs that are responsible for carrying 
traffic between the OSP and its millions of users. Large 
OSPs such as Google and Microsoft often also have their 
own backbone network to interconnect the DCs. 


2.2 Cost of carrying traffic 


The traffic of an OSP traverses both internal links that 
connect the DCs and external links that connect to neigh- 
boring ISPs. The cost model is different for the two types 
of links. The internal links are either dedicated or leased. 
Their cost is incurred during acquisition, and any recur- 
ring cost is independent of the traffic volume that they 
carry. Hence, we can ignore this cost when engineering 
an OSP’s traffic. 


The cost of an external link is a function of traffic vol- 
ume, i.e., F'(v), where F is a non-decreasing cost func- 
tion and v is the charging volume of the traffic. The cost 
function F’ is commonly of the form prace x v, where 
price is the unit traffic volume price of a link. The charg- 
ing volume v is based on actual traffic volume. A com- 
mon practice is to use the 95th-percentile (P95). Under 
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this scheme, the traffic volume on the link is sampled for 
every 5-minute interval. At the end of a billing period, 
e.g., a month, the charging volume is the 95th percentile 
across all the samples. Thus, the largest 5% of the in- 
tervals are not considered, which protects an OSP from 
being charged for short bursts of traffic. 


In principle, the charging volume is the maximum of 
the P95 traffic in either direction. However, since user re- 
quests tend to be much smaller than server replies for on- 
line services, the outgoing direction dominates. Hence, 
we ignore inbound traffic when optimizing the cost of 
OSP traffic. 


2.3 Performance measure of interest 


There are several ways to measure the user-perceived 
performance of an online service. In consultation with 
OSP operators, we use round trip time (RTT) as the per- 
formance measure, which includes the latency between 
the DC and the end host along both directions. The per- 
formance of many online services, such as search, email, 
maps, and instant messaging, is latency-bound. Small in- 
crements in latency can lead to significant losses in rev- 
enue [20] . 


Some online services may also be interested in other 
performance measures such as available bandwidth or 
loss rate along the path. A challenge with using these 
measures for optimizing OSP traffic is scalable estima- 
tion of performance for tens of thousands of paths. Ac- 
curate estimation of available bandwidth or loss rate 
using current techniques requires a large number of 
probes [17, 19,25]. We leave for the future the task of 
extending our work to other performance measures. 


2.4 Cost-performance optimization 


A consequence of the distributed and rich connectivity 
of an OSP network is that an OSP can easily have more 
than a hundred ways to reach a given user in a destination 
prefix. First, an OSP usually replicates an online service 
across multiple DCs in order to improve user experience 
and robustness. An incoming user request can thus be 
directed to any one of these DCs, e.g., using DNS redi- 
rection. Second, the traffic to a given destination prefix 
can be routed to the user via one of many routes, either 
provided by one of the ISPs that directly connect to that 
DC or by one of the ISPs that connect to another DC at 
another location (by first traversing internal links). As- 
suming P DCs and an total of @ ISPs, the number of 
possible alternative paths for a request-response round 
trip is P * Q. (An OSP can select which DC will serve a 
destination prefix, but it typically does not control which 
link is used by the incoming traffic.) 


The large number of possible alternative paths and dif- 
ferences in their cost and performance creates an op- 
portunity for optimizing OSP traffic. This optimization 
needs to select the target DC and the outgoing route for 
each destination prefix. The (publicly known) state-of- 
the-art in optimizing OSP traffic is mostly manual and 
ad hoc. The default practice is to map a destination pre- 
fix to a geographically close DC and to let BGP control 
the outgoing route from that DC. BGP’s route selection 
is performance-agnostic and can take cost into account in 
a coarse manner at best. On top of that, exceptions may 
be configured manually for prefixes that have very poor 
performance or very high cost. 


The complexity of the problem, however, limits the 
effectiveness of manual methods. Effective optimization 
requires decisions based on the cost-performance trade- 
offs of hundreds of thousands of prefixes. Worse, the 
decisions for various prefixes cannot be made indepen- 
dently because path capacity constraints create complex 
dependencies among prefixes. Automatic methods are 
thus needed to manage this complexity. The develop- 
ment of such methods is the focus of our work. 


3 Problem Formulation 


Consider an OSP as a set of data centers DC = {dc;} 
and a set of external links LINK = {link,;}. The DCs 
may or may not be interconnected with backbone links. 
The OSP needs to deliver traffic to a set of destination 
prefixes D = {d;,} on the Internet. For each dz, the OSP 
has a variety of paths to route the request and reply traf- 
fic, as illustrated in Figure 2. A TE strategy is defined 
as a collection of assignments of the traffic (request and 
reply) for each d;, to a path(dc,,link,;). Each assign- 
ment conceptually consists of two selections, namely DC 
selection, e.g., selecting a dc;, and route selection, e.g., 
selecting a link;. The assignments are subject to two 
constraints. First, the traffic carried by an external link 
should not exceed its capacity. Second, a prefix d, can 
use lank; only if the corresponding ISP (which may be a 
peer ISP instead of a provider) provides routes to dz. 


Each possible TE strategy has a certain level of ag- 
gregate performance and incurs certain traffic cost to the 
OSP. Our goal is to discover the optimal TE strategies 
that represent the cost-performance trade-offs desired by 
the OSP. For instance, the OSP might want to maximize 
performance for a given cost. Additionally, the relevant 
inputs to this optimization are highly dynamic. Path per- 
formance as well as traffic volume of a prefix, which de- 
termines cost, change with time. We thus want an effi- 
cient, online scheme that adapts the TE strategy as the 
inputs evolve. 


NSDI 710: 7th USENIX Symposium on Networked Systems Design and Implementation 


35 


36 





Figure 2: OSP traffic engineering problem. 


4 Entact Key Techniques 


In this section, we provide an overview of the key tech- 
niques in Entact. We present the details of their imple- 
mentations in the next section. There are two primary 
challenges in the design of an online TE scheme in a 
large OSP network. The first challenge is to measure 
in real time the performance and cost of routing traffic to 
a destination prefix via any one of its many alternative 
paths that are not currently being used, without actually 
redirecting the current traffic to those alternative paths. 
Further, to keep up with temporal changes in network 
conditions, this measurement must be conducted at suf- 
ficiently fine granularity. The second challenge is to use 
that cost-performance information in finding a TE strat- 
egy that matches the OSP’s goals. 


4.1 Computing cost and performance 


To quantify the cost and performance of a TE strategy, 
we first measure the performance of individual prefixes 
along various alternative paths. This information is then 
used to compute the aggregate performance and cost 
across all prefixes. 


4.1.1 Measuring performance of individual prefixes 


Our goal is to measure the latency of an alternative path 
for a prefix with minimal impact on the current traffic, 
e.g., without actually changing the path being currently 
used for that prefix. One possible approach is to in- 
fer this latency based on indirect measurements. Pre- 
vious studies have proposed various techniques for pre- 
dicting the latency between two end points on the Inter- 
net [10, 14,22,27]. However, they are designed to predict 
the latency of the current path between two end points in 
the Internet, and hence are not applicable to our task of 
measuring alternative paths. 


We measure the RTT of alternative paths directly us- 
ing a novel route injection technique. To measure an al- 
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ternative path which uses a non-default route R for pre- 
fix p, we select an IP address 2p within p and install the 
route R for ip/32 in the network. This special route is 
installed to the routers in the OSP by a BGP daemon that 
maintains iBGP peering sessions with them. Because 
of the longest-prefix match rule, packets destined to zp 
will follow the route A and the rest of the traffic will 
follow the default route. Once the alternative route is in- 
stalled, we can measure the RTT to p along the route R 
using data-plane probes to 2p (details in 85.1). Simul- 
taneous measurements of multiple alternative paths can 
be achieved by choosing a distinct IP address for each 
alternative path. 


4.1.2 Computing performance of a TE strategy 


The measurements of individual prefixes can be used 
to compute the aggregate performance of any given TE 


Strategy. We use the weighted average RTT (wRT'T), 
», Dele XAT TT, 
> vol, 
mance measure, where vol, is the volume of traffic to 
prefix p, and RIT; is the RTT of the path to p in the 
given TE strategy. The traffic volume vol, is estimated 


based on the Netflow data collected in the OSP. 


, of all the traffic as the aggregate perfor- 


4.1.3 Computing cost of a TE strategy 


A challenge in optimizing traffic cost is that the actual 
traffic cost is calculated based on the 95% link utiliza- 
tion over a long billing period (e.g., a month), while an 
online TE scheme needs to operate at intervals of min- 
utes or hours. While there exist online TE schemes that 
optimize P95 traffic cost [15], the complexity of such 
schemes makes them inapplicable to a large OSP net- 
work with hundreds of neighbor ISPs. We thus choose to 
only consider short-term cost in TE optimization rather 
than directly optimizing P95 cost. Our hypothesis 1s that, 
by consistently employing low-cost strategies in each 
short interval, we can lower the actual traffic cost over 
the billing period. We present results that validate this 
hypothesis in 97. 

We use a simple computation to quantify the cost of a 
TE strategy in an interval. As discussed in §2.2, we need 
to focus only on the external links. For each external link 
L, we add the traffic volume to all prefixes that choose 
that link in the TE strategy, e.g., Volt = >~ vol, where 
prefix p uses link L for vol, amount of traffic. The total 
traffic cost of the OSP is }), Fr, (Volz), where F',(.) is 
the pricing function of the link L. Because this measure 
of cost is not the actual traffic cost over the billing period, 
we refer to this measure as pseudo cost. 
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Figure 3: The cost-performance tradeoff in TE strategy 
space. 


4.2 Computing optimal TE strategies 


We now present our optimization framework that uses 
the cost and performance information to derive the desir- 
able TE strategy for an OSP. We first assume the traffic 
to a destination prefix can be arbitrarily divided among 
multiple alternative paths and obtain a class of optimal 
TE strategies. In this class of strategies, one cannot im- 
prove performance without sacrificing cost or vice versa. 
Second, we describe how we select a strategy in this class 
that best matches the cost-performance trade-off that the 
OSP desires. Third, since in practice the traffic to a pre- 
fix cannot be arbitrarily split among multiple alternative 
paths, we devise an efficient heuristic to find an integral 
solution that approximates the desired fractional one. 


4.2.1 Searching for optimal strategy curve 


Given a TE strategy, we can plot its cost and performance 
(weighted average RTT or wRT'T’) on a 2-D plane. This 
is illustrated in Figure 3 where each dot represents a strat- 
egy. The number of strategies is combinatorial, \V, Ne for 
N, prefixes and N, alternative paths per prefix. A key 
observation is that not all strategies are worth exploring. 
In fact, we only need to consider a small subset of opti- 
mal strategies that form the lower-left boundary of all the 
dots on the plane. A strategy is optimal if no other strat- 
egy has both lower wRT'T and lower cost. Effectively, 
the curve connecting all the optimal strategies forms an 
optimal strategy curve on the plane. 


To compute this curve, we sweep from a lower bound 
on possible wRT'T values to an upper bound on possi- 
ble wRT'T' values at small increments, e.g., 1 ms, and 
compute the minimum cost for each wRT'T'value in this 
range. These bounds are set loosely, e.g., the lower 
bound can be zero and the upper bound can be ten times 
the wRT'T of the default strategy. 


GivenawkRT'T R in this range, we compute the min- 
imum cost using linear programing (LP). Following the 
notations in Figure 2, let f;;; be the fraction of traffic to 
dj, that traverses path(dc;, link;) and rtt;,;; be the RTT 


to d;, via path(dc;, link,;). The problem of computing 
cost can then be described as: 


min pseudoCost = S (price; x S- as x Ole), 
le 4 


a 


subject to: 


SoS (eis x voly) < ps X cap; (1) 


k a 


ee XWol, XTi) Ss S— volk xR (2) 


SOS fris =1 (3) 
a) 


Condition 1 represents the capacity constraint for each 
external link and p is a constant (by default 0.95) that 
reserves some spare capacity to accommodate potential 
traffic variations for online TE. Condition 2 represents 
the wRT'T' constraint. Condition 3 ensures all the traf- 
fic to a destination is carried. The objective is to find 
feasible values for variables /},;; that minimize the total 
pseudo cost. Solving such an LP for all possible values 
of R and connecting the TE strategy points thus obtained 
yield the optimal strategy curve. 


4.2.2 Selecting a desirable optimal strategy 


Each strategy on the optimal strategy curve represents a 
particular tradeoff between performance and cost. Based 
on its desired tradeoff, an OSP will typically be inter- 
ested in one or more of these strategies. Some of these 
strategies are easy to identify, such as minimum cost for 
a given performance or minimum wRITT for a given cost 
budget. Sometimes, an OSP may desire a more com- 
plex tradeoff between cost and performance. For such an 
OSP, we take a parameter K as an input. This parameter 
represents the additional unit cost the OSP is willing to 
bear for a unit decrease in WRTT. 


The desirable strategy for a given K corresponds to 
the point in the optimal strategy curve where the slope 
of the curve becomes higher than AK when going from 
right to left. More intuitively, this point is also the “turn- 
ing point” or the “sweet spot” when the optimal strategy 
curve is plotted after scaling the wRTT by K. We can au- 
tomatically identify this point along the curve as the one 
with the minimum value of pseudoC'ost + K -wRTT. 
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This point is guaranteed to be unique because the opti- 
mal strategy curve is convex. For convenience, we de- 
fine pseudoC’ost + K -wRT'T as the utility of a strat- 
egy. Lower utility values are better. We can directly 
find this turning point by slightly modifying the origi- 
nal optimization problem to minimize utility instead of 
by solving the original optimization problem for all pos- 
sible WRTT values. 


4.2.3 Finding a practical strategy 


The desirable strategy identified above assumes that traf- 
fic to a prefix can be split arbitrarily across multiple 
paths. In practice, however, the traffic to a prefix can 
only take one alternative path at a time, and hence vari- 
ables f;,;; must be either 0 or 1. Imposing this require- 
ment makes the optimization problem an Integer Linear 
Programming (ILP) problem, which is NP-hard. We de- 
vise a heuristic to approximate the fractional solution to 
an optimal strategy with an integral solution. Intuitively, 
our heuristic searches for an integral solution “near” the 
desired fractional one. 


We start with the fractional solution and sort all the 
destination prefixes d; 1n the ascending order based on 
avail, = Di icr, jess | where vol; is the traffic 
volume to d;, R», is the set of external links that have 
routes to reach d;, and availC ap, is the available ca- 
pacity at link;. The availC ap, is initialized to be the 
capacity of link; and updated each time a prefix is as- 
signed to use this link. The avail, measure gives high 
priority to prefixes with large traffic volume and small 
available capacity. We then greedily assign the prefixes 
to paths in the sorted order. 


Given a destination d;, and its corresponding f;,;;’s in 
the fractional solution, we randomly assign all of its traf- 
fic to one of the paths path(dc;,link,;) that has enough 
residual capacity for dy with a probability proportional to 
frij- Compared to assigning the traffic to the path with 
the largest f;,;;, random assignment is more robust to a 
bad decision for one particular destination. Once a pre- 
fix is assigned, the available capacity of the selected link 
is adjusted accordingly, and the avazl;,-based ordering 
of the remaining unassigned prefixes is updated as well. 
In theory, better integral solutions can be obtained using 
more sophisticated methods [26]. But as we show later, 
our simple heuristic approximates the fractional solution 
closely. 


5 Prototype Implementation 


In this section, we describe our implementation of En- 
tact. As shown in Figure 4, there are three inputs to En- 
tact. The first input is Netflow data from all routers in the 
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Figure 4: The Entact architecture 


OSP network, which gives us information on flows cur- 
rently traversing the network. The second input is rout- 
ing tables from all routers, which gives us information 
not only on routes currently being used and but also on 
alternative routes offered by neighbor ISPs. The third in- 
put is the information on link capacities and prices. The 
output of Entact is a recommended TE strategy. 


Entact divides time into fixed-length windows of size 
T’Ewin and a new output is produced in every window. 
To compute the TE strategy in window 2, the measure- 
ments of traffic volume and path performance from the 
previous window are used. We assume that these quan- 
tities change at a rate that is much slower than T'F,,;;,. 
We later validate this assumption and also evaluate the 
impact of T'Fy;,. The recommended TE strategy is ap- 
plied to the OSP network by injecting the selected routes, 
similar to the route injection of /32 IP addresses. 


5.1 Measuring path performance 


As mentioned before, to obtain measurements on the per- 
formance of alternative paths to a prefix, we inject spe- 
cial routes to IP addresses in that prefix and then measure 
performance by sending probes to those IP addresses. 
We identify IP addresses within a prefix that respond to 
our probes using the Live IP collector component (Fig- 
ure 4). The Route Injector component injects routes to 
those IP addresses, and the Probers measure the path per- 
formance. We describe each of these components below. 


Live IP collector. Live IP collector is responsible for ef- 
ficiently discovering IP addresses in a prefix that respond 
to our probes. A randomly chosen IP address in a prefix 
is unlikely to be responsive. We use a combination of two 
methods to discover live IP addresses. The first method 
is to probe a subset of IP addresses that are found in Net- 
flow data. The second method is the heuristic proposed 
in [28]. This heuristic prioritizes and orders probes to a 
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small subset of IP addresses that are likely to respond, 
e.g., *.1 or *.127 addresses, and hence is more efficient 
than random scanning of IP addresses. 


Discovering one responsive IP address in a prefix is 
not enough; we need multiple IP addresses to probe mul- 
tiple paths simultaneously and also to verify if the prefix 
is in a single geographical location (see 86.1). Even the 
combination of our two methods does not always find 
enough responsive IP addresses for every Internet prefix. 
In this paper, we restrict ourselves to those prefixes for 
which we can find enough responsive IP addresses. We 
show, however, that our results likely apply to all pre- 
fixes. In the future, we plan to overcome this responsive 
IP limitation by enlisting user machines, e.g., through 
browser toolbars. 


Route injector. Route injector selects alternative routes 
from the routing table obtained from routers in the OSP 
network, and installs the selected alternative routes on 
the routers. The route injector is a BGP daemon that 
maintains 1BGP session with all core and edge routers 
in the OSP network. The daemon dynamically sends and 
withdraws crafted routes to those routers. We explain the 
details of the injection process using a simple example. 
We denote a path for a prefix p from data center DC’ 
as path( DC, egress — nexthop), where egress is the 
OSP’s edge router along the path, and nexthop is the 
ISP’s next hop router that is willing to forward traffic 
from egress to p. In Figure 5, suppose the default BGP 
route of p follows path( DC, E, — N,) and we have two 
other alternative paths. Given an IP address JP within 
p, to measure an alternative path path( DC, Ez — N2) we 
do the following, 


e Inject 1 P2/32 with nexthop as E> into all the core 
routers C1, Co, and C3 


e Inject 1 P)/32 with nexthop as N2 into EF». 


Now, traffic to 1 P2 will traverse the alternative path that 
we want to measure, while all traffic to other IP addresses 
in p, e.g., [P,, will still follow the default path. Simi- 
larly, we can inject another IP address I P3/32 within p 
and simultaneously measure the performance of the two 
alternative paths. With n IP addresses in a prefix, we can 
simultaneously measure the performance of n alternative 
paths from each DC. The route injection only needs to be 
performed once. The injected routes are re-used across 
all TE windows, and updated only when there are routing 
changes. If more than n paths need to be measured, we 
can divide a TE window into smaller slots, and measure 
only n paths in each slot. In this case, the route injector 
needs to refresh the injected routes for each slot. 


We implement the daemon that achieves the above 
functionality by feeding configuration commands to 








Figure 5: Route injection in a large OSP network. 


drive bgpd, an existing BGP daemon [3]. We omit im- 
plementation details due to space limit. It is important, 
however, to note that the core and edge routers should be 
configured to keep the injected routes only to themselves. 
Therefore, route injection does not encounter route con- 
vergence problems, or trigger any route propagation in 
or outside the OSP network. 


Probers. Probers are located at all data centers in the 
OSP network and probe the live IPs along the selected 
alternative paths to measure their performance. For each 
path, a prober takes five RTT samples and uses the me- 
dian as the representative estimate of that path. The prob- 
ing module sends a TCP ACK packet to a random high 
port of the destination. This will often trigger the desti- 
nation to return a TCP RST packet. Compared with us- 
ing ICMP probes, the RTT measured by TCP ACK/RST 
is closer to the latency experienced by applications be- 
cause ICMP packets may be forwarded in the network 
with lower priority [16]. 


5.2 Computing TE strategy 


The computation of the TE strategy is based on the path 
performance data, the prefix traffic volume information, 
and the desired operating point of the OSP. The prefix 
traffic volume is computed by the traffic preprocessor 
component in Figure 4. It uses Netflow streams from 
all core routers and computes the traffic volume to each 
prefix by mapping each destination IP address to a prefix. 
For scalability, the Netflow data in our implementation is 
sampled at the rate of 1/1000. 


Finally, the TE optimizer component implements 
the optimization process described in 84.2. It uses 
MOSEK [6] to solve the LP problems required to gen- 
erate the optimal strategy. After identifying the optimal 
fractional strategy, the optimizer converts it to an integer 
strategy which becomes the output of the optimization 
process. 
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Figure 6: Location of the 11 DCs used in experiments. 


6 Experimental Setup 


We conduct experiments in Microsoft’s global network 
(MSN), one of the largest OSPs today. Figure 6 shows 
the location of the 11 MSN DCs that we use. These 
DCs span North America, Europe, and Asia Pacific and 
are inter-connected with high-speed dedicated and leased 
links that form the backbone of MSN. MSN has roughly 
2K external links, many of which are free peering be- 
cause that helps to lower transit cost for both MSN and its 
neighbors. The number of external links per DC varies 
from fewer than ten to several hundreds, depending on 
the location. We assume that services and corresponding 
user data are replicated to all DCs. In reality, some ser- 
vices may not be present at some of the the DCs. The 
remainder of this section describes how we select des- 
tination prefixes and how we quantify the performance 
and cost of a TE strategy. 


6.1 Targeted destination prefixes 


To reduce the overhead of TE, we focus on the high- 
volume prefixes that carry the bulk of traffic and whose 
optimization has significant effects on the aggregate cost 
and performance. We start with the top 30K prefixes 
which account for 90% of the total traffic volume. A 
large prefix advertised in global routing sometimes spans 
multiple geographical locations [21]. We could han- 
dle multi-location prefixes by splitting them into smaller 
sub-prefixes. However, as explained below, we would 
need enough live IP addresses in each sub-prefix to deter- 
mine whether a sub-prefix is single-location or not. Due 
to the limited number of live IP addresses we can dis- 
cover for each prefix (85.1), we bypass the multi-location 
or low-volume prefixes in this paper. 


We consider a prefix to be at a single location if the 
difference between the RTTs to any pair of IP addresses 
in itis under 5 ms. This is the typical RTT value between 
two nodes in the same metropolitan region [21]. A key 
parameter in this method is V;,, the number of live IP 
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Figure 7: Maximum RTT difference among N;,, IPs 
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Table 1: Locations of the 6K prefixes in our experiments. 


addresses to which the RTTs are measured. On the one 
hand, we need to measure enough live IP addresses in or- 
der not to mis-classify a multi-location prefix as a single- 
location one. On the other hand, we can only identify a 
limited number of live IP addresses in a prefix. 


To choose an appropriate N;,,, we examine the 4.1K 
prefixes that have at least 8 live IP addresses. Figure 7 
illustrates the distributions of the maximum RTT differ- 
ence of each of these prefixes as N;, varies from 2 to 
8. While the gap is significant between the distributions 
of N;,=2 and N;,=4, it becomes less pronounced as Nj, 
increases beyond 4. There is only an 8% difference be- 
tween the distributions of N;,= 4 and N;,=8 when the 
maximum RTT difference is 5 ms. We thus pick N;, = 4 
to balance the accuracy of single-location prefix identifi- 
cation and the number of prefixes available for use. 


After discarding prefixes with fewer than 4 live IP ad- 
dresses, we are left with 15K prefixes. After further dis- 
carding prefixes that are deemed multi-location, we are 
left with 6K prefixes which we use in our study. Table | 
characterizes these prefixes by continents and traffic vol- 
umes. While a large portion of the prefixes and traffic 
are from North America and Europe, we also have some 
coverage in the remaining three continents. The prefixes 
are in 2,791 distinct ASes and account for 26% of the to- 
tal MSN traffic. The number of alternative routes for a 
prefix varies at different DC locations. Among the 66K 
DC-prefix pairs, 61% have | to 4 routes, 27% has 5 to 8 
routes, and the remaining 11% has more than 8 routes. 


Our focus on a subset of prefixes raises two questions. 
First, are the results based on these prefixes applicable to 
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all prefixes? Second, how should we appropriately scale 
link capacities? We consider both questions next. 


6.1.1 Representativeness of selected prefixes 


We argue that the subset of prefixes that we study lets 
us estimate well the cost-performance trade-off for all 
traffic carried by MSN. For a given set of prefixes, the 
benefits of TE optimization hinge on the existence of al- 
ternative paths that are shorter or cheaper than the one 
used in the default TE strategy. We find that in this re- 
spect our chosen set of prefixes (P,) is similar to other 
prefixes. We randomly select 14K high-volume prefixes 
(P;,) and 4K low-volume prefixes (P;), which account for 
29% and 0.8% of the total MSN traffic respectively. For 
each prefix p in P, or P;, we can identify 2 live IP ad- 
dresses at the same location (with RTT difference under 
5 ms). This means at least some sub-prefix of p will be 
at a single-location, even though p could span multiple 
locations. 


For each prefix in P,, P, and P;, we measure the RTT 
of the default route and three other randomly selected al- 
ternative routes from all the 11 DCs every 20 minutes 
for | day. We compare the default path used by the de- 
fault TE strategy, e.g., the path chosen by BGP from the 
closest DC, with all other 43 (may be fewer due to the 
availability of routes) alternative paths. Figure 8 illus- 
trates the number of alternative paths that are better than 
the default path in terms of (a) performance, (b) cost, or 
(c) both. We see that the distributions are similar for the 
three sets of prefixes, which suggests that each set has 
similar cost-performance trade-off characteristics. Thus, 
our TE optimization results based on P, are likely to hold 
for other traffic in MSN. 


6.1.2 Scaling link capacity 


Each external link has a fixed capacity that limits the traf- 
fic volume that it can carry. We extract link capacities 
from router configuration files in MSN. Because we only 
study a subset of prefixes, we must appropriately scale 
link capacities for our evaluation. 


Let P,y and P, denote the set of all the prefixes and 
the set of prefixes that we study. One simple approach 
is to scale down the capacity of all links by a constant 
‘i arabe where vol,;; and vol, are the traffic vol- 
umes of the two set of prefixes in a given period. The 
problem with this approach is that it overlooks the spa- 
tial and temporal variations of traffic, since ratio actu- 
ally depends on which link or which period we consider. 
This prompts us to compute a ratio for each link sepa- 
rately. Our observation is that a link is provisioned for 
certain utilization level during peak time. Given link,, 


; peak" 
we set ratio; = —-, 
J peak; 


where peak" and peak; are 
the peak traffic volume to P,j; and to P, under the de- 
fault TE strategy during any 5-minute interval. This en- 
sures the peak utilization of link; is the same before and 
after scaling. Note that peak,j and peak, may occur in 


different 5-minute intervals. 


Our method for scaling down link capacity is influ- 
enced by the default TE strategy. For instance, if link, 
never carries traffic to any prefix in P, in the default strat- 
egy, its capacity will be scaled down to zero. This limits 
the alternative paths that can be explored in TE optimiza- 
tion, e.g., any alternative strategies that use link, will not 
be considered even though they may help to lower wRTT 
and/or cost. Due to this limitation, our results, which 
show significant benefits for an OSP, actually represent a 
lower bound on the benefits achievable in practice. 


6.2 Quantifying performance and cost 


To quantify the cost of a given TE strategy, we record 
the traffic volume to each prefix and compute the traffic 
volume on each external link in each 5-minute interval. 
We then use this information to compute the 95% traffic 
cost (P95) over the entire evaluation period. Thus, even 
though Entact does not directly optimize for P95 cost, 
our evaluation measures the cost that the OSP will bear 
under the P95 scheme. We consider only the P95 scheme 
in our evaluation because it is the dominant charging 
model in MSN. Some ISPs do offer other charging mod- 
els, such as long-term flat rate. Some ISPs also impose 
penalties if traffic volume falls below or exceeds a certain 
threshold. We leave for future work evaluating Entact 
under non-P95 schemes. 


To quantify the performance, we compute the wRTT 
for each 5-minute interval and take the weighted average 
across the entire evaluation period. A minor complica- 
tion is that we do not have fine time-scale RTT measure- 
ments for a prefix. To control overhead of active probing 
and route injection, we obtain two measurements (where 
each measurement is based on sending 5 RTT probes) in 
a 20-minute interval. 


We find, however, that these coarse time-scale mea- 
surements are a good proxy for predicting finer time- 
scale performance. To illustrate this, we randomly se- 
lect 500 prefixes and 2 alternate routes for each selected 
prefix. From each DC, we measure each of these 1,000 
paths once a minute during a 20-minute interval. We 
then divide the interval into four 5-minute intervals. For 
each path and a 5-minute interval, we compute rtts by 
averaging the 5 measurements in that interval. For the 
same path, we also compute rttoo by averaging two ran- 
domly selected measurements in the 20-minute interval. 
We conduct this experiment for 1 day and calculate the 
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Figure 8: Number of alternative paths that are better than the default path in the set of 6K single-location prefixes, the 


set of 14K other high-volume prefixes, and the set of 4K randomly selected low-volume prefixes. 


difference between rtts and rtto of all paths. It turns 
out that rttso are indeed very close to rtts. The differ- 
ence is under 1 ms and 5 ms in 78% and 92% of the cases 
respectively. 


7 Results 


In this section, we demonstrate and explain the benefits 
of online TE optimization in MSN. We also study how 
the TE optimization results are affected by a few key pa- 
rameters in Entact, including the number of DCs, num- 
ber of alternative routes, and TE optimization window. 
Our results are based on one-week of data collected in 
September 2009, which allows us to capture the time-of- 
day and day-of-week patterns. Since the traffic and per- 
formance characteristics in MSN are usually quite stable 
over several weeks, we expect our results to be applica- 
ble to longer duration as well. 


Currently, the operators of MSN only allow us to in- 
ject /32 prefixes into the network in order to restrict the 
impact of Entact on customer traffic. As a result, we 
have limited capability in implementing a non-default 
TE strategy since we cannot arbitrarily change the DC 
selection or route selection for any prefix. Instead, we 
can only simulate a non-default TE strategy based on the 
routing, performance and traffic data collected under the 
default TE strategy in MSN. When presenting the follow- 
ing TE optimization results, we assume that the routing, 
performance and traffic to each prefix do not change un- 
der different TE strategies. This is acommon assumption 
made by most of the existing work on TE [9, 12, 15]. We 
hope to study the effectiveness of Entact without such 
restrictions in the future. 


7.1 Benefits of TE optimization 


Figure 9 compares the wRTT and cost of four TE strate- 
gies, including the default, Entact;g (KK = 10), Lowest- 
Cost (minimizing cost with K = O), and BestPerf (min- 
imizing WRTT with K = inf). We use 20-minute TE 
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Figure 9: Comparison of various TE strategies. 


window and 4 alternative routes from each DC for TE op- 
timization. The x-axis is the WRIT in milliseconds and 
the y-axis is the relative cost. We cannot reveal the actual 
dollar cost for confidentiality reason. There is a big gap 
between the default strategy and Entact,9, which indi- 
cates the former is far from optimal. In fact, Entact;9 can 
reduce the default cost by 40% without inflating wRTT. 
This could lead to enormous amount of savings for MSN 
since it spends tens of millions of dollars a year on transit 
traffic cost. 


We also notice there is significant tradeoff between 
cost and performance among the optimal strategies. In 
one extreme, the LowestCost strategy can eliminate al- 
most all the transit cost by diverting traffic to free peer- 
ing links. But this comes at the expense of inflating the 
default WRTT by 38 ms. Such a large RTT increase will 
notably degrade user-perceived performance when am- 
plified by the many round trips involved in download- 
ing content-rich Web pages. In the other extreme, the 
BestPerf strategy can reduce the default wRTT by 3 ms 
while increasing the default cost by 66%. This is not 
an appropriate strategy either given the relatively large 
cost increase and small performance gain. Entact;9 ap- 
pears to be at a “sweet-spot’” between the two extremes. 
By exposing the performance and cost of various opti- 
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13.7—55.8 


cheaper, longer 27.639.8 | 738.3-177.8 
55.5—47.8 | 483.7-174.4 


Table 2: Comparison of paths under the default and 
Entactjo9 strategies in terms of performance and cost. 





path type 
non-default DC, default route 


non-default DC, non-default route 
default DC, non-default route 





Table 3: Comparison of paths under the default and 
Entact;9 strategies in terms of DC selection and route 
selection. 


mal strategies, the operators can make a more informed 
decision regarding which is a desirable operating point. 


To better understand the source of the improvement of- 
fered by Entact;9, we compare Entact;9 with the default 
strategy during a typical 20-minute TE window. Table 2 
breaks down the prefixes based on their relative pseudo 
cost and performance under these two strategies. Over- 
all, the majority (88.2%) of the prefixes are assigned to 
the default path in Entact;g. Among the remaining pre- 
fixes, very few (0.1%) use a non-default path that is both 
longer and pricier than the default path (which is well 
expected). Only a small number of prefixes (1.7%) use 
a non-default path that is both cheaper and shorter. In 
contrast, 10.1% of the prefixes use a non-default path 
that is better in one metric but worse in the other. This 
means Entactj 9 is actually making some “intelligent” 
performance-cost tradeoff for different prefixes instead 
of simply assigning each prefix to a “better” non-default 
path. For instance, 4.6% of the prefixes use a shorter but 
pricier non-default path. While this slightly increases the 
pseudo cost by 42.1, it helps to reduce the WRTT of these 
prefixes by 14.5 ms. More importantly, it frees up the ca- 
pacity on some cheap peering links which can be used 
to carry traffic for certain prefixes that incur high pseudo 
cost under the default strategy. 5.5% of the prefixes use a 
cheaper but longer non-default path. This helps to dras- 
tically cut the pseudo cost by 560.5 at the expense of a 
moderate increase of WRTT (12.2 ms) for these prefixes. 
Note that Entact;9 may not find a free path for every 
prefix due to the performance and capacity constraints. 
The complexity of the TE strategy within each TE win- 
dow and the dynamics of TE optimization across time 
underscore the importance of employing an automated 
TE scheme like Entact in a large OSP. 
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Figure 10: Effect of DC selection on TE optimization. 
(Utility and cost are scaled according to WRIT of the 
default strategy.) 


Table 3 breaks down the prefixes that use a non-default 
path under Entact;,9 during the 20-minute TE window by 
whether a non-default DC or a non-default route from 
a DC is used. Both non-default DCs and non-default 
routes are used under Entact;y — 4.6% of the prefixes 
use a non-default DC and 9.7% of them use a non-default 
route from a DC. Non-default routes appear to be more 
important than non-default DCs in TE optimization. We 
will further study the effect of DC selection and route 
selection in 87.2 and 87.3. 


Figure 9 shows that the difference between the integral 
and fractional solutions of Entact;9 is negligibly small. 
In TE optimization, the traffic to a prefix will be split 
across multiple alternative paths only when some alter- 
native paths do not have enough capacity to accommo- 
date all the traffic to that prefix. This seldom happens 
because the traffic volume to a prefix is relatively small 
compared to the capacity of a peering link in MSN. 


We also compare the online Entact,9 with the offline 
one. In the latter case, we directly use the routing, perfor- 
mance, and traffic volume information of a 20-minute TE 
window to optimize TE in the same window. This rep- 
resents the ideal case where there is no prediction error. 
Figure 9 shows the online Entact; 9 incurs only a little 
extra WRIT and cost compared to the offline one (The 
two strategy points almost completely overlap). This is 
because the RTT and traffic to most of the prefixes are 
quite stable during such a short period (e.g., 20 minutes). 
We will study to what extent the TE window affects the 
optimization results in $7.4. 


7.2 Effects of DC selection 


We now study the effects of DC selection on TE opti- 
mization. A larger number of DCs will provide more al- 
ternative paths for TE optimization, which in turn should 
lead to better improvement over the default strategy. 
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Nonetheless, this will also incur greater overhead in RTT 
measurement and TE optimization. We want to under- 
stand how many DCs are required to attain most of the 
TE optimization benefits. For each prefix, we sort the 
11 DCs based on the RTT of the default route from each 
DC. We only use the RTT measurements taken in the first 
TE window of the evaluation period to sort the DCs. The 
ordering of the DCs should be quite stable and can be 
updated at a coarse-granularity, e.g., once a week. We 
develop a slightly modified Entact;’ which only consid- 
ers the alternative paths from the closest » DCs to each 
prefix for TE optimization. 


Figure 10 compares the wRIT, cost, and _ utility 
(84.2.2) of Entact{, as n varies from 1 to 11. We use 4 al- 
ternative routes from each DC to each prefix. Given a TE 
window, as n changes, the optimal strategy curve and the 
optimal strategy selected by Entact'') will change accord- 
ingly. This complicates the comparison between two dif- 
ferent Entact},’s since one of them may have higher cost 
but smaller WRTT. For this reason, we focus on compar- 
ing the utility for different values of n. As shown in the 
figure, Entact}, (only with route selection but no DC se- 
lection) and Entact?, can cut the utility by 12% and 18% 
respectively compared to the default strategy. The utility 
reduction diminishes as n exceeds 2. This suggests that 
TE optimization benefits can be attributed to both route 
selection and DC selection. Moreover, selecting the clos- 
est two DCs for each prefix seems to attain almost all the 
TE optimization benefits. Further investigation reveals 
that most prefixes have at most two nearby DCs. Using 
more DCs generally will not help TE optimization be- 
cause the RTT from those DCs is too large. 


Note that the utility of Entact{j is slightly higher than 
that of Entact7). This is because the utility of Entact? 
is computed from the 95% traffic cost during the en- 
tire evaluation period. However, Entact;’ only minimizes 
pseudo utility computed from pseudo cost in each TE 
window. Even though the pseudo utility obtained by 
Entact; in a TE window always decreases as n grows, 
the utility over the entire evaluation period may actually 
move in the opposite direction. 


7.3 Effects of alternative routes 


We evaluate how TE optimization is affected by the num- 
ber of alternative routes (m) from each DC. A larger m 
will not only offer more flexibility in TE optimization 
but also incur greater overhead in terms of route injec- 
tion, optimization, and RTT measurement. In this exper- 
iment, we measure the RTT of 8 alternative routes from 
each DC to each prefix every 20 minutes for 1 day. Fig- 
ure 11 illustrates the wRTT, cost, and utility of Entact19 
under different m. For the same reason as in the previ- 
ous section, we focus on comparing utility. As m grows 
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Figure 11: Effect of the number of alternative routes on 
TE optimization. 
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Figure 12: Effect of the TE window on TE optimization. 


from | to 3, the utility gradually decreases up to 14% 
compared to the default strategy. The utility almost re- 
mains the same after m exceeds 3. This suggests that 2 
to 3 alternative routes are sufficient for TE optimization 
in MSN. 


7.4 Effects of TE window 


Finally, we study the impact of TE window on optimiza- 
tion results. Entact performs online TE in a TE window 
using predicted performance and traffic information (85). 
On the one hand, both performance and traffic volume 
can vary significantly within a large TE window. It will 
be extremely difficult to find a fixed TE strategy that per- 
forms well during the entire TE window. On the other 
hand, a small TE window will incur high overhead in 
route injection, RTT measurement, and TE optimization. 
It may even lead to frequent user-perceived performance 
variations. 


Figure 12 illustrates the WRIT, cost, and utility of 
Entact;9 under different TE window sizes from 20 min- 
utes to 4 hours. As before, we focus on comparing the 
utility. We still use 4 alternative routes from each DC 
to each prefix. Entact,9 can attain about the same utility 
reduction compared to the default strategy when the TE 
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ee injection time | CPU RIB FIB 
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Table 4: Route injection overhead measured on a testbed. 





window is under | hour. This is because the performance 
and traffic volume are relatively stable during such time 
scale. As the TE window exceeds | hour, the utility no- 
ticeably increases. With a 4-hour TE window, Entact19 
can only reduce the default utility by 1%. In fact, be- 
cause the traffic volume can fluctuate over a wide range 
during 4 hours, Entact;9 effectively optimizes TE for the 
peak interval to avoid link congestion. This leads to a 
sub-optimal TE strategy for many non-peak intervals. In 
88, we show that an 1-hour TE window imposes reason- 
ably low overhead. 


8 Online TE Optimization Overhead 


So far, we have demonstrated the benefits provided by 
Entact. In this section, we study the feasibility of deploy- 
ing Entact to perform full-scale online TE optimization 
in a large OSP. The key factor that determines the over- 
heads of Entact is the number of prefixes. While there 
are roughly 300K Internet prefixes in total, we will fo- 
cus on the top 30K high-volume prefixes that account for 
90% of the traffic in MSN (86.1). Multi-location pre- 
fixes may inflate the actual number of prefixes beyond 
30K; we leave the study of multi-location prefixes as fu- 
ture work. The results in 87.3 and $7.4 suggest that En- 
tact can attain most of the benefits by using 2 alternative 
routes from each DC and an 1|-hour TE window. We now 
evaluate the performance and scalability of key Entact 
components under these settings. 


8.1 Route injection 


We evaluate the route injection overhead by setting up a 
router testbed in the Schooner lab [4]. The testbed com- 
prises a Cisco 12000 router and a PC running our route 
injector. Cisco 12000 routers are commonly used in the 
backbone network of large OSPs. When Entact initial- 
izes, it needs to inject 30K routes into each router in or- 
der to measure the RTT of the default route and one non- 
default route simultaneously. This injection process can 
be spread over several days to avoid overloading routers. 
Table 4 shows the size of the RIB (routing information 


base) and FIB (forwarding information base) as the num- 
ber of injected routes grows. 30K routes merely occupy 
about 4.8 MB in the RIB and FIB. Such memory over- 
head is relatively small given that today’s routers typ1- 
cally hold roughly 300K routes (the number of all Inter- 
net prefixes). 


After the initial injection is done, Entact needs to con- 
tinually inject routes to apply the output of the online TE 
optimization. Table 4 also shows the injection time of 
different number of routes. It takes only 51 seconds to 
inject 30K routes, which is negligibly small compared to 
the 1-hour TE window. We expect the actual number of 
injected routes in a TE window to be much smaller be- 
cause most prefixes will simply use a default route (87.1). 


8.2 Impact on downstream ISPs 


Compared to the default TE strategy, the online TE opti- 
mization performed by Entact may cause traffic to shift 
more frequently. This is because Entact needs to contin- 
ually adapt to changes in performance and traffic volume 
in an OSP. A large traffic shift may even overload cer- 
tain links in downstream ISPs, raising challenges in the 
TE of these downstream ISPs. This problem may ex- 
acerbate if multiple large OSPs perform such online TE 
optimization simultaneously. Given a 5-minute interval 
2, we define a total traffic shift to quantify the impact of 
an online TE strategy on downstream ISPs: 


TotalShift; = )~ shifti(p)/ S~ voli(p) 
Pp 


Pp 


Here, vol;(p) is the traffic volume to prefix p and 
shift;(p) is the traffic shift to p in interval 7. If p stays 
on the same path in intervals 7 and 7 — 1, shi ft;(p) is 
computed as the increase of vol; (p) over vol;_1(p). Oth- 
erwise, shift;(p) = vol;(p). In essence, shi ft;(p) cap- 
tures the additional traffic load imposed on downstream 
ISPs relative to the previous interval. The additional traf- 
fic load is either due to path change or due to natural traf- 
fic demand growth. 


Figure 13 compares the T’otal Shift under the static 
TE strategy, the default TE strategy, and Entactjg over 
the entire evaluation period. In the static strategy, the 
TE remains the same across different intervals, and its 
traffic shift is entirely caused by natural traffic demand 
variations. We observe that most of the traffic shift is 
actually caused by natural traffic demand variations. The 
traffic shift of Entactjo is only slightly larger than that 
of the default strategy. As explained in 87.1, Entactjo 
assigns a majority of the prefixes to a default path and 
only reshuffles the traffic to roughly 10% of the prefixes. 
Moreover, the paths of those 10% prefixes do not always 
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Figure 13: Traffic shift under the static, default, and 
Entact;9 TE strategies 


change across different intervals. As a result, Entactio 
incurs limited extra traffic shift compared to the default 
Strategy. 


8.3 Computation time 


Entact, computes an optimal strategy in two steps: 1) 
solving an LP problem to find a fractional solution; 11) 
converting the fractional solution into an integer one. Let 
n be the number of prefixes, d be the number of DCs, 
and / be the number of peering links. The number of 
variables f;;,; in the LP problem is n x d x 1. Since d and 
/ are usually much smaller than n and do not grow with n, 
we consider the size of the LP problem to be O(n). The 
worst case complexity of an LP problem is O(n*:°) [1]. 
The heuristic for converting the fractional solution into 
an integer one (84.2.3) requires n iterations to assign n 
prefixes. In each iteration, it takes O(n log(n)) to sort 
the unassigned prefixes in the worst case. Therefore, the 
complexity of this step is O(n? log(n)). 

We evaluate the time to solve the LP problem since 
it is the computation bottleneck in TE optimization. We 
use Mosek [6] as the LP solver and measure the opti- 
mization time of one TE window on a Windows Server 
2008 machine with two 2.5 GHz Xeon processors and 
16 GB memory. We run two experiments using the top 
20K high-volume prefixes and all the 300K prefixes re- 
spectively. The RTTs of the 20K prefixes are from real 
measurement while the RTTs of the 300K prefixes are 
generated based on the RTT distribution of the 20K pre- 
fixes. We consider 2 alternative routes from each of the 
11 DCs to each prefix. The traffic volume, routing, and 
link price and capacity information are directly drawn 
from the MSN dataset. The running time of the two ex- 
periments are 9 and 171 seconds respectively, represent- 
ing a small fraction of an 1-hour TE window. 
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8.4 Probing requirement 


To probe 30K prefixes in an 1-hour TE window, the band- 
width usage of each prober will be 30K (prefixes) x 2 
(alternative routes) x 2 (RTT measurements) x 5 (TCP 
packets) x 80 (bytes) / 3600 (seconds) = 0.1 Mbps. Such 
overhead is negligibly small. 


8.5 Processing traffic data 


We use a Windows Server 2008 machine with two 2.5 
GHz Xeon processors and 16 GB memory to collect and 
process the Netflow data from all the routers in MSN. 
It takes about 80 seconds to process the traffic data of 
one 5-minute interval during peak time. Because Netflow 
data is processed on-the-fly as the data is streamed to 
Entact, such processing speed is fast enough for online 
TE optimization. 


9 Related Work 


Our work is closely related to the recent work on explor- 
ing route diversity in multihoming, which broadly falls 
into two categories. The first category includes mea- 
surement studies that aim to quantify the potential per- 
formance benefits of exploring route diversity, including 
the comparative study of overlay routing vs. multihom- 
ing [7,8,11,24]. These studies typically ignore the cost of 
the multihoming connectivity. In [7], Akella et al. quan- 
tify the potential performance benefits of multthoming 
using traces collected from a large CDN network. Their 
results show that smart route selection has the potential 
to achieve an average performance improvement of 25% 
or more for a 2-multihomed customer in most cases, and 
most of the benefits of multihoming can be achieved us- 
ing 4 providers. Our work differs from these studies in 
that it considers both performance and cost. 


The second category of work on multihoming includes 
algorithmic studies of route selection to optimize cost, 
or performance under certain cost constraint [12, 15]. 
For example, Goldenberg et al. [15] design a number 
of algorithms that assign individual flows to multiple 
providers to optimize the total cost or the total latency 
for all the flows under fixed cost constraint. Dhamdhere 
and Dovrolis [12] develop algorithms for selecting ISPs 
for multihoming to minimize cost and maximize avail- 
ability, and for egress route selection that minimizes the 
total cost under the constraint of no congestion. Our 
work differs from these algorithmic studies in a few ma- 
jor ways. First, we propose a novel joint TE optimization 
technique that searches for the optimal “‘sweet-spot” in 
the performance-cost continuum. Second, we present the 
design and implementation details of a route-injection- 
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based technique that measures the performance of alter- 
native paths in real-time. Finally, to our knowledge, we 
provide the first TE study on a large OSP network which 
exhibits significantly different characteristics from mul- 
tihoming stub networks previously studied. 


Our work as well as previous work on route selection 
in multihoming differ from numerous work on intra- and 
inter-domain traffic engineering, e.g., [13, 18,23]. The 
focus of these later studies is on balancing the utilization 
of ISP links instead of on optimizing end-to-end user per- 
formance. 


10 Conclusion 


We studied the problem of optimizing cost and perfor- 
mance of carrying traffic for an OSP network. This prob- 
lem is unique in that an OSP has the flexibility to source 
traffic from different data centers around the globe and 
has hundreds of connections to ISPs, many of which 
carry traffic to only parts of the Internet. We formulated 
the TE optimization problem in OSP networks, and pre- 
sented the design of the Entact online TE scheme. Us- 
ing our prototype implementation, we conducted a trace- 
driven evaluation of Entact for a large OSP with 11 data 
centers. We found that that Entact can help this OSP re- 
duce the traffic cost by 40% without compromising per- 
formance. We also found these benefits can be realized 
with acceptably low overheads. 
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Abstract 


Existing flooding algorithms have demonstrated their 
effectiveness in achieving communication efficiency and 
reliability in wireless sensor networks. However, fur-— 
ther performance improvement has been hampered by 
the assumption of link independence, a design premise 
imposing the need for costly acknowledgements (ACKs) 
from every receiver. In this paper, we present Collec— 
tive Flooding (CF), which exploits the link correlation to 
achieve flooding reliability using the concept of collec- 
tive ACKs. CF requires only 1-hop information from a 
sender, making the design highly distributed and scalable 
with low complexity. We evaluate CF extensively in real— 
world settings, using three different types of testbeds: 
a single hop network with 20 MICAz nodes, a multi— 
hop network with 37 nodes, and a linear outdoor net- 
work with 48 nodes along a 326-metertong bridge. Sys— 
tem evaluation and extensive simulation show that CF 
achieves the same reliability as the state-of-the art solu— 
tions, while reducing the total number of packet trans— 
mission and dissemination delay by 30 ~ 50% and 35 ~ 
50%, respectively. 


1 Introduction 

In wireless sensor networks, flooding is a protocol that 
delivers a message from one node to all the other nodes. 
Flooding is a fundamental operation for time synchro- 
nization [15], data dissemination [25, 26, 17, 10], group 
formation [14], node localization [39], and routing tree 
formation [6, 29]. 

Existing flooding algorithms [18, 34, 24, 12] have 
demonstrated their effectiveness in achieving communi-— 
cation efficiency and reliability in wireless sensor net-— 
works. Further performance improvement, however, has 
been hampered by the implicit assumption of link inde- 
pendence adopted in previous designs. In other words, 
existing flooding algorithms assume that the reception 
of a flooding message by multiple neighboring nodes is 
probabilistically independent of each other. Under such 
an assumption, it is necessary to have an acknowledge-— 
ment (ACK) directly from the intended receiver for reli— 
able flooding. This is because a node’s ACK cannot be 
used to estimate the reception at other neighboring nodes 
if link independence is assumed. 

However, direct ACKs per receiver may lead to high 
collision [11, 8], congestion [2], and possibly the ACK 


storm problem [24] in wireless networks. To address this 
inefficiency in ACKs, this work presents the first compre— 
hensive study to exploit link correlation in the context of 
flooding design in wireless sensor networks. The driving 
idea behind our design is collective ACKs. Previously, a 
sender estimated whether a transmission was successful 
based only on the feedback from the intended receiver. 
Instead, the mechanism of collective ACKs allows the 
sender to infer the success of a transmission to a receiver 
based on the ACKs from other neighboring receivers 
by utilizing the link correlation among them. Specifi— 
cally, we use the Conditional Packet Reception Proba— 
bility (CPRP) as a metric to characterize the correlation 
among links. The CPRP is the probability of a node’s 
successfully receiving a packet, given the condition that 
its neighbor has received the same packet. Based on the 
environment’s stability, this metric 1s measured and cal- 
culated online among neighboring nodes using a form 
of hello message at an adaptive time interval (1.e., small 
interval when the environment is dynamic and large in- 
terval when the environment is stable). 


With link correlation information (CPRP) available 
among neighboring nodes, collective ACKs are achieved 
in an accumulative manner. The success of a transmis— 
sion to a node (defined as the coverage probability of a 
node) is no longer a binary (0/1) estimation, but a prob-— 
ability value between 0 and 1. Using collective ACKs, a 
sender updates the coverage probability values of neigh— 
boring receivers whenever (i) it transmits or (11) over— 
hears a rebroadcast message. To improve efficiency, a 
transmission is considered necessary only when the cov— 
erage probability of a neighboring node has not reached 
a certain user-desired reliability threshold. 


In addition to collective ACKs, we propose a dynamic 
forwarding technique to exploit link correlation further. 
In Collective Flooding, only a small set of nodes is se— 
lected dynamically as the forwarders of a flooding mes— 
sage via self-organized competition among neighboring 
nodes. Every node estimates its transmission effective— 
ness based on three factors: (1) neighborhood size, (11) 
link quality, and (111) link correlation among neighboring 
nodes. The most effective node will start to rebroadcast 
early to suppress the less effective nodes’ rebroadcasts, 
consequently reducing the message redundancy. 


In summary, our contributions are as follows: 
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Figure 2. Distribution of Conditional Packet Reception Probability 


e To our knowledge, collective ACK is a new concept 
that can improve the efficiency of reliable flooding op— 
erations. It transforms the traditional direct ACKs per 
receiver into correlated and accumulative ACKs. 

e Although the phenomena of link correlation has been 
mentioned in the literature [31], we provide the first ex— 
tensive study to exploit this phenomena for communica— 
tion improvement. We reveal that link correlation can be 
used to achieve (1) collective ACKs, as well as (11) effi- 
cient forwarder selection. 

e Our design is simple and symmetric. Rebroadcast deci— 
sions at individual nodes are based on the coverage prob-— 
ability values of neighbors, which in turn are updated 
by overhearing rebroadcasts from their neighbors. All 
the operations only need 1-hop neighbors’ information, 
making our protocol highly distributed and scalable. 

e We evaluate our work extensively in multiple real— 
world testbeds and large scale simulation. The results 
indicate that our design is practical, reliable, and outper— 
forms several existing state-of-the-art designs. 

The rest of the paper is organized as follows: Sec— 
tion 2 presents the motivation behind the work. Section 3 
introduces two key mechanisms. Section 4 describes the 
design. Sections 5 and 6 evaluate the work with testbeds 
and simulation. After discussing related work in Sec— 
tion 7, Section 8 concludes the paper. 


2 Motivation 


Previous studies on wireless links focus on packet re— 
ceptions of individual receivers with single [31, 33, 4, 
43, 22] or multiple [21] radios. Little systematic study 
has investigated the packet reception correlation among 
neighboring receivers. To fill the gap, this section re— 
ports our empirical study on wireless link correlation. 
More specifically, we observed the following phenom— 
ena, which serves as the foundation of this work. 
Observation: For packets transmitted from the same 
sender, if a packet is received by a node with a low 
packet reception ratio (PRR), it is highly probable that 
this packet is also received by the nodes with a high PRR. 
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2.1 Experiment Setup 


In our experiments, 42 MICAz nodes were used. 
The experiments were conducted with multiple randomly 
generated layouts under two different scenarios: (1) an 
open parking lot, and (11) an indoor office. In each sce— 
nario, two types of experiments were conducted: Fixed 
Single Sender and Round Robin Sender. In the Fixed 
Single Sender experiment, the sender was placed in the 
center of the topology, while the other 41 nodes were 
randomly deployed as receivers. The sender broadcasted 
a packet in every 200ms. Each packet was identified by 
a sequence number. The total number of packets broad- 
casted was 6000. In the Round Robin Sender experi-— 
ment, each node in turn broadcasted 200 packets with 
time intervals of 200ms. The receivers kept track of the 
received packets through the sender’s ID and packet se— 
quence number. 


2.2 Correlated Packet Reception 


In both indoor and outdoor experiments, we discov— 
ered that if a packet is received by a sensor node with 
low PRR, most of the time this packet is also received by 
the high PRR nodes. Figure I(a) and 1(b) illustrate the 
first 600 packet receptions of two groups of three nodes 
in indoor and outdoor experiments, respectively. The lo— 
cations of the nodes are shown in Figure 3 and Figure 4. 
The black bands correspond to the packets received at the 
nodes. Clearly, there exists a strong correlation of packet 
receptions among the neighboring nodes. For example, 
in Figure 1(a), given the two packets (sequence number 
282 and 508) received by N22, these two packets were 
also received by N29 and N23. In order to quantify this 
correlation, we define the Conditional Packet Reception 
Probability (CPRP), as follows: 


Definition: The Conditional Packet Reception Probabil- 
ity is the probability that a high PRR node N,, receives 
a packet M from sender node S, given the condition that 
the packet M is received by a low PRR node N_. 


We use Ps(N,|N;) to denote CPRP, where N;, and N; 
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Figure 3. Packet Reception Ratios (PRR) of Individual Nodes in an Indoor Experiment 


are neighboring receivers of the sender S$. For exam-— 
ple, in Figure 1(b), node N31 received 38 packets and 
37 out of these packets were also received by node N27, 
so Ps(N27|N31) = 97.4%. If the assumption on link 
independence holds, we would expect Ps(N27|N31) = 
Ps(N27). However, this is not the case; as shown by the 
experiment, Ps(N27) is 64.9% instead of 97.4%. This in— 
dicates a packet reception correlation between N31 and 
27, which is also valid for node pairs N31 = N24 and 
N27 & N24. 


To analyze the CPRP among the pairwise receivers 
more systematically, we computed the CPRP for all node 
pairs with non-zero PRR values. In the indoor exper-— 
iment, we had 32 non-zero PRR nodes, which gener-— 
ates 3253! = 496 combinations of Ps(N,|N;). Figure 2(a) 
shows the distribution of these combinations. Figure 2(b) 
illustrates the distribution for the outdoor experiment. 

Figure 2(a) and 2(b) show that the conditional 
packet reception probability Ps(N,|N;) is collectively 
distributed close to 100%. This result verified our obser— 
vation that if a packet is received by a low PRR node, this 
packet has a high probability of also being received by a 
high PRR node. Due to physical constraints, we only 
evaluated the link correlation by using MICAz platform. 
The background traffic or interference would also cause 
the link correlation on the other radio platforms [32]. 


2.3 Spatial Diversity in PRR 


Besides the link correlation, we also confirmed that 
the packet reception ratios (PRR) of the receivers had a 
diverse spatial distribution [4]. Figure 3 shows the spatial 
distribution of PRR in the indoor Fixed Single Sender ex— 
periment. The centers of “e” and “x” indicate the nodes’ 
locations. The larger the size of a “e”’, the higher the PRR 
value of this node, while “x” represents nodes that do not 
receive any packet. The numbers underneath the nodes’ 
locations are Packet Reception Ratio (PRR) values. 

Our Fixed Single Sender experiments show that even 
when two receivers are physically very close to each 
other, these receivers may have totally different PRRs. 
For example, in Figure 3, although N36 and N39 are 
located near each other at the upper-right corner, their 
PRRs are significantly different, 81.55% and 2.64%, re- 
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Figure 4. PRR(%) for Each Node (Outdoor) 


spectively. There are many node pairs with such features, 
such as N34 and N37, N23 and N22. A similar phe- 
nomenon also occurs in the outdoor experiment, such as 
N19 and N10, shown in Figure 4. 


2.4 Opportunity and Challenges on Flooding 


While these observations would impact many proto— 
col designs, this paper focuses on the flooding protocol 
design in particular. 


Link Correlation: Existing flooding protocols did not 
take advantage of this correlated reception feature. As a 
result, direct ACKs from receivers is normally used when 
high reliability is desired. In other words, every receiver 
needs to send ACKs in response to the reception of a 
packet, leading to high communication overhead (when 
explicit ACKs are used) or high redundancy in rebroad— 
casting (when implicit ACKs are used). The research 
challenge here is how to exploit link correlation, so that 
the overhead of ACKs is reduced. 

PRR Spatial Diversity: Section 2.3 shows that within 
the radio range of a sender, even when receivers are lo— 
cated close to each other, they may have dramatically dif- 
ferent PRRs due to environmental effects such as multi— 
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Figure 5. Collective ACKs 


path fading. If a flooding protocol selects a fixed for— 
warder, this forwarder has to retransmit a large number 
of times to accommodate the receivers with low PRRs, 
introducing excessive duplicated reception for those re— 
ceivers with high PRRs. The challenge here is how to 
reduce the impact of spatial diversity, so that the over— 
head of redundant transmissions is reduced. 

In the rest of the paper, we present two corresponding 
mechanisms to deal with these two challenges respec— 
tively. We explain the ideas conceptually first in Sec— 
tion 3, followed by detailed design in Section 4. 


3 Key Mechanisms in Collective Flooding 


The main objective of collective flooding (CF) is to 
reduce redundant transmissions inside the network while 
providing reliable message dissemination. In CF, we 
call a node a covered node if it has already received the 
broadcasting packet. Covered nodes are responsible for 
rebroadcasting the packet to uncovered nodes in the net— 
work. In our design, rebroadcasting is used as an im— 
plicit ACK to the sender to save protocol overhead. We 
note that CF can be also applied when explicit ACKs are 
used. Specifically, there are two key mechanisms in the 
CF protocol: 

e Collective ACKs: In CF, the overhearing of a node’s 
rebroadcasting not only indicates that this node has re— 
ceived the packet, but also serves as a collective ACK of 
reception for some other neighboring nodes. 

e Dynamic Forwarder Selection: The forwarder is se— 
lected dynamically through competition among nodes 
that have already received the broadcasting packet. 


3.1 Benefit of Collective ACKs 


The mechanism of collective ACKs allows a node 
to extract information about the status of its neighbor- 
ing nodes via receiving or overhearing a packet from 
its neighbors. For example, in Figure 5, suppose that 
node S is a covered node while N1 and N2 are uncov— 
ered. They are within 1-hop communication range of 
each other, where N1 is a low PRR receiver of S and 
N2 is a high PRR receiver of S. When S broadcasts, if 
N1 receives the packet, in traditional flooding protocols 
without considering the correlation, V1 only knows that 
S is covered, but still considers N2 as uncovered until V1 
overhears N2’s rebroadcasting. 

CF takes a different approach. From N1’s viewpoint, 
a packet from S serves two purposes. First, it is a direct 
ACK that S is a covered node. Second, it also serves as 
a collective ACK to N1 that N2 has a reception proba— 
bility of Ps(N2|N1). Similarly, from S’s viewpoint, if S 
later overhears the rebroadcasting (1.e., an implicit ACK) 
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Figure 6. Example of Collective ACKs 


from N1, S not only gets a direct ACK that N1 is cov— 
ered, but is also able to compute the coverage probability 
of N2 according to the link correlation metric Py; (N2|S). 
We note that in traditional designs, overhearing a packet 
serves only as a direct ACK that the packet sender is cov— 
ered. In CF, the ACK is achieved in a collective manner, 
1.e., overhearing a packet serves as both direct and corre— 
lated ACKs from the packet sender and its neighbors. 


Collective ACKs can greatly reduce the redundant 
transmission. For the sake of clarity, let us consider the 
simplified network shown in Figure 6. The link quali— 
ties from node S to N1 and N2 are 25% and 10%, re- 
spectively; the link qualities from N1 and N2 to S are 
10% and 100%, respectively. We assume the CPRP 
Ps(N1|N2) = 100%, which means that if N2 receives a 
packet from S, V1 also receives that packet. 

In traditional flooding protocols, the sender S' treats 
the receivers’ packet receptions as independent. To pro- 
vide reliable broadcasting, S$ needs to keep on transmit-— 
ting until it receives ACKs or overhears the transmissions 
from both V1 and N2. Due to the low link quality from 
N1 back to $ (10%), S might conduct many redundant 
retransmissions. In contrast, collective ACKs in CF al- 
low node S to terminate the transmission earlier if N2 
receives the flooding packet with a smaller number of re— 
transmissions than expected. For example, if V2 receives 
the packet at the first attempt (uckily) and rebroadcasts, 
node S can immediately terminate the retransmission 
to N1, based on the assumption Ps(N1|N2) = 100%. 
Therefore, in this case, the number of transmissions at 
node S can be reduced to one. As we can see from the 
above simplified example, collective ACKs can improve 
the efficiency of the reliable flooding protocol by utiliz— 
ing the link correlation. 


3.2 Benefit of Dynamic Forwarder Selection 


As discussed in Section 2.4, a fixed-forwarder scheme 
has to accommodate the receiver with the lowest recep— 
tion ratio, leading to high redundancy for the nodes with 
high reception ratios. To address this problem, in CF the 
covered nodes compete for becoming the forwarder node 
based on their transmission effectiveness, which is de— 
fined in Section 4.4 and calculated according to three fac— 
tors: (4) neighborhood size, (11) link quality, and (111) cov— 
erage probability based on link correlation metric CPRP. 
A node’s transmission is considered more effective if the 
node has more uncovered neighbors and good link qual— 
ities to them. This node wins the competition and re— 
broadcasts with the shortest back-off time. The nodes 
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transmission would change the transmission effective— 
ness value of this node and its neighboring nodes. Based 
on the transmission effectiveness, the forwarder is dy— 
namically selected. This design avoids the same node 
being selected as the forwarder all the time. 

To illustrate the benefit further, we give an example 
to demonstrate the process of the dynamic forwarder se— 
lection. Again, let us consider a simplified scenario as in 
Figure 7. The link quality from source node (S$) to N2 
is 25%. All the other link qualities are 100%. N3 and 
N4 are 2 hops away from S. In order to minimize the to— 
tal number of transmissions, traditional approaches, such 
as [18, 12], intend to select a fixed-forwarder among 
neighboring nodes according to their uncovered neigh— 
bor size, in other words, capability of covering more un— 
covered nodes. For example, in Figure 7, S selects N2 
as a dedicated forwarder to rebroadcast the packet. The 
reason is that N2 has more uncovered neighbors (N3 and 
N4) than N1, which only has one uncovered neighbor 
(N3). However, due to the unreliable link between S and 
N2, a packet needs to be transmitted 4 times on aver— 
age from S before it is received by N2. Then, another 
transmission 1s needed by N2 to cover N3 and N4. In to— 
tal, an average of 5 transmissions are needed for a single 
network-wide broadcast. 

In contrast, CF adopts a dynamic and opportunistic 
approach. After S broadcasts, S, N1, and N2 compete to 
be a forwarding node instead of using a dedicated for- 
warder. Based on the actual reception status, there are 
two Cases: 

Case 1: If N2 receives the packet (luckily) at first at— 
tempt, N2 can tell that N1 is covered based on CPRP 
Ps(N1|N2) = 100%. After marking N1 as a covered 
node, N2 still has more uncovered neighbors (N3 and 
N4) than S and V1, and thus N2 wins the forwarder selec— 
tion competition and rebroadcasts. After N2’s rebroad— 
casting, all the nodes update their neighbors’ coverage 
probabilities and find that all their neighbors are covered. 
Therefore, the competition stops and no more transmis— 
sions are needed. The total number of transmissions for 
the network is 2. 

Case 2: If N2 does not receive the packet, a competi— 
tion occurs between S and N1. N1 is supposed to win be- 
cause it has both N2 and N3 as uncovered nodes, while S 
has only N2 as an uncovered node. After N1’s broadcast-— 
ing, N2 will receive the packet and win the competition 
because it is the only node that has uncovered neighbor 
N4. One more transmission is needed from N2 to cover 
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Figure 8. State Machine Diagram of CF 
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N4. Therefore, the total number of transmissions for the 
network is 3. 

By introducing competition among the covered nodes, 
CF reduces the redundant transmissions. In the above ex— 
ample, even in the worst case (Case 2), CF only needs 3 
transmissions, which is smaller than the traditional fixed— 
forwarder approaches’ 5 times. 


4 The Collective Flooding Protocol 


This section presents the main design of the CF, which 
is a simple finite state machine. As shown in Figure 8, a 
node running CF is in one of three states at any time: (1) 
maintenance, (11) receiver, and (111) sender. Transitions 
between the states are triggered by events. 

After the CF protocol is initiated, the node enters the 
maintenance state, in which all of its 1-hop neighbor in— 
formation is periodically maintained. Here, two nodes 
are considered as neighbors if the link quality between 
them is larger than 0%. Whenever the node receives a 
broadcasting data packet, the node enters the receiver 
state and uses this packet as a collective ACK to update 
its neighbors’ coverage probabilities. If the node has un— 
covered neighbors, it sets its back-off timer based on its 
transmission effectiveness, then goes back to the mainte— 
nance state. When the node’s back-off timer fires, which 
means it wins the competition, it enters the sender state, 
in which it sends out the packet and updates its neigh— 
bors’ coverage status; after that, it goes back to the main— 
tenance state. This procedure repeats until the node esti— 
mates that all its neighbors are covered. In the rest of this 
section, we explain the operations in each state in detail. 


4.1 Maintenance State 

Wireless links in sensor networks are known to be dy— 
namic. Therefore, maintenance is needed to keep track 
of the link quality. In CF, every node periodically sends 
out a hello message at an adaptive time interval T which 
is increased or decreased based on the link’s stability. 
Every hello message is identified by the node ID and 
a packet sequence number. The hello message is used 
not only for 1-hop neighbor discovery, but also for up— 
dating the link qualities and calculating the Conditional 
Packet Reception Probability (CPRP) among neighbor- 
ing nodes. While link quality calculation is straightfor— 
ward, the calculation of CPRP deserves a little more ex— 
planation. Every node maintains a reception record of 
all hello messages from its neighboring nodes within a 
time window wT. In order to reduce the required mem— 
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ory space and mitigate the overhead of control messages, 
the record is represented in a bitmap format (e.g., |(0110)) 
for each neighbor. Such records are exchanged within 
a hello message every w7’ seconds among neighboring 
nodes. CPRP is calculated as follows: 
ae Byx (i) & Byu(i) (1) 
EO) 
Here v is the sender; k and u are the two receivers. By, (i) 
is a bit representing node k’s reception status of the ith 
hello message sent from node v. B,,(i) = 1 if k receives 
this message, otherwise B,;(i) = 0. For example, in Fig— 
ure 9, a bitmap of [1110], from node k indicates that k 
does not receive node v’s 4th transmission. When node 
u receives this bitmap, it can use Equation | to calculate 
CPRP by performing the bit-wise AND operation with 
its own bitmap ({0110],). For example, The CPRP is 
calculated as P,(k|u) = He ee = 100%. We 
note that the length of the bitmap w strikes a balance be- 
tween the control overhead and statistical confidence of 
the CPRP value. 


4.2 Receiver State 

A node enters the receiver state once it receives or 
overhears a broadcasting packet. Nodes in the receiver 
state compete to be selected as a forwarder. Without los— 
ing generality, suppose node u is the receiver of sender v. 
Node u maintains two pieces of information: 

e Coverage Probability: This is the probability of a 
neighboring node’s being covered in a broadcast from 
the viewpoint of a node. For example, CP,(k) is node 
u’s estimated probability that w’s neighbor k has received 
the broadcasting packet. Node u maintains CP,,(k) for all 
its 1-hop neighboring nodes k € N(u), where N(u) is u’s 
neighboring node set. 

e Estimated Uncovered Node Set U(u): Here U(u) C 
N(u). Initially, node u considers all of its 1-hop neigh- 
bors as uncovered. So for any node k € N(u), CP, (k) = 0. 
u’s uncovered node set is U(u) = N(u). 

Supposing node u receives a broadcasting packet M 
from its neighboring sender v, this packet serves two pur— 
poses. First, since the packet M is received from node v, 
u updates the coverage probability of v as CP,(v) = 1, 
meaning that u is sure that v has already received the 
packet (note that this is actually a direct implicit ACK). 
Second, the packet also serves as a collective ACK for 
all other neighbors k € N(u). Based on the conditional 
packet reception probability P,(k|w) stored in its neigh— 
bor table, the coverage probability of other nodes k € 
{N(u) —v} is updated as follows: 

CP, (k) _— 1—(1—CP,(k))-(1—P,(Alu)) (2) 


P,(k|u) = 
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where the term (1 — CP,(k)) is the probability that k had 
not received the packet M before v’s forwarding; the term 
(1 — P,(k|u)) is the probability of k’s failure to receive 
M from v given the condition that u received M. So 
1—(1-CP,,(k))-(1 —P,(k|u)) is the probability of node 
k’s being covered either by (1) previous transmission in 
the network or (11) current forwarding from v. We note 
that CP,,(k) is the coverage probability estimated by node 
u. Formula 2 utilizes P,(k|u) to accumulate node u’s con— 
fidence in treating k as a covered node. Namely, u’s re— 
ceiving from v also serves as a collective ACK for k. In 
the worst case when there is no link correlation, the con— 
ditional packet reception probability of a node will be 
equal to the link quality (i.e., P,(k|u) = P,(k)). In this 
case, our flooding protocol uses the link quality informa— 
tion (i.e., P,(k)) to update the coverage probability via 
Formula 2. 

When coverage probability CP,(k) reaches a user’s 
pre-specified threshold @ < 1, node k is considered by 
node u as covered and is removed from node u’s uncov— 
ered node set U(u). If U(u) is not empty, node u joins the 
competition for being the next local forwarder by setting 
its back-off timer according to its transmission effective— 
ness, which is detailed later in Section 4.4. If U(u) is 
empty, node u exits the receiver state and completes its 
broadcasting mission, as shown in Figure 8. 

We note that before the timer expires, if node u over— 
hears the broadcast packet M again from one of its neigh— 
bors, u cancels the current running timer and repeats the 
above coverage probability updating procedure and re— 
sets its timer. If w’s timer fires before all other competi— 
tors’, u is selected as the forwarder. It enters the sender 
state to send out broadcast packet M and perform related 
updates, as explained in the next subsection. 


4.3 Sender State 


By winning the forwarder competition, node u enters 
the sender state and sends out the packet. Then, it up— 
dates the coverage probabilities of its uncovered neigh-— 
bors k € U(u) with 


CP,(k) — 1—(1—CP,(k))- A -Ll,k)) GB) 


Here L(u,k) is the link quality between u and k, so the 
term (1 — L(u,k)) indicates the probability of k’s fail- 
ure to receive the broadcasting packet from u. From u’s 
point of view, k has a probability of (1 —CP,(k)) of being 
uncovered before u’s forwarding, and therefore the term 
1 — (1 —CP,,(k))-(1 — L(u,k)) shows the probability of 
the event that node k either was covered previously or 1s 
covered by u’s current forwarding. 

As in the receiver state, when the coverage probabil— 
ity CP,(k) of node k reaches a user-specified threshold 
a, node k is considered covered and is removed from the 
uncovered node set U(u). If U(u) is empty, node u ter- 
minates the flooding task; otherwise, node uw joins the 
forwarder competition again by setting its back-off timer 
and returning to the maintenance state. We note that the 
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value of & is used as a threshold to terminate the retrans— 
mission at each node. Different @ values can achieve 
different reliabilities for different applications. 


4.4 Back-off Timer Design 


The back-off timer is used to conduct dynamic for-— 
warder selection. In the forwarder competition, the dura— 
tion of the back-off timer is carefully set according to a 
combination of factors, including (1) neighborhood size, 
(11) link quality, and (411) neighbors’ CPRPs. Intuitively, 
if a node has more uncovered neighbors with good link 
quality, this node should be the next forwarder and thus 
should have a short duration before the timer fires. 

In CF, we define Transmission Effectiveness (TE) as a 
reference metric for setting the back-off time period. 


Definition: TE(u) equals the number of uncovered 
nodes that are expected to be covered if the sender u 
transmits once. 


In general, the value of TE for node u can be calcu— 
lated with the following Equation 


TE(u)= )) L(u,k)-(1—CP,(k)) (4) 
keU(u) 

The meaning of Equation 4 is straightforward. The 
transmission effectiveness of u equals the summation of 
the probabilities of covering u’s uncovered neighbors, 
namely the expected number of nodes to be covered by 
u’s forwarding. The higher the TE (uw) value, the more ef— 
fective node u’s transmission 1s. For example, in a perfect 
link (L(u,k) = 100%) scenario, if node u has 2 uncov— 
ered nodes, the TE value of wis TE(u) =1x1+1x1= 
2, meaning that if u transmits once, 2 neighbors get cov— 
ered. In another example, if the link qualities from u 
to these uncovered nodes are all 50%, the TE value of 
uis TE(u) =0.5x1+0.5 x 1 =1, meaning that if u 
transmits once, one neighbor is expected to get covered. 
Since node u updates the value of CP,,(k) whenever it 
sends or receives a broadcast packet from its neighbors, 
T E(u) changes dynamically during the dissemination of 
the packet. 

Every node continuously updates its TE value and sets 
the back-off timer based on the TE value. Intuitively, the 
higher the TE value, the smaller the back-off time period 
should be. The rationale behind this is that we always 
select the forwarder which is able to cover more nodes in 
the network with one transmission. 


4.5 The Detailed Protocol 


Combining all the design components, CF can be 
specified by the pseudo code shown in Protocol 1. 

The design is simple and requires only 1-hop infor-— 
mation. Each node maintains a state machine with three 
states. A state transition is triggered by the event of either 
receiving a broadcast packet (Line 4) or sending timer 
fired (Line 12). Lines 4 to 11 handle the event of re- 
ceiving a packet. Lines 12 to 18 handle the timer fired 
event by sending out the packet and updating the cover-— 


Protocol 1: Collective Flooding 


1 Initially, U(u)-N(u), Vk € U(u), CP, (k)—0; 
2 repeat 

3 switch Event do 

4 case u receives packet from v 

5 for k € U(u) do 

6 if k = v then CP, (k)—1; 

7 else Updates CP,,(k) via Formula 2; 
8 


end 

9 Call Update U(u); 
10 end 
11 Call Test U(u); 
12 case timer fired 
13 u sends out the packet; 
14 for k € U(u) do 
15 Update CP,,(k) via Formula 3; 
16 Call Update U(u); 
17 end 
18 Call Test U(u); 
19 end 
20 end 


21 until U(u)=0; 


22 Update U(u) method : 
23 if CP,(k) > a then 

24 | U(u) —U(u)—{k}; 
25 end 


26 Test U(u) method : 

27 if U(u) #6 then Set back-off timer; 
28 else Terminate the timer; 

29 end 


age probability values of neighboring nodes. Lines 22 to 
25 update the uncovered set of a node, and lines 26 to 29 
determine whether the flooding task has been finished. 

In summary, the CF protocol has three efficient fea— 
tures: (i) it can be implemented with a simple finite state 
machine with 3 states, which is suitable for resource con— 
strained sensor nodes; (11) it deals with the spatial diver— 
sity of packet reception with dynamic forwarder selec— 
tion; and (111) it reduces the communication redundancy 
through collective ACKs, eliminating costly direct ACKs 
from every receiver. 


5 Implementation and Evaluation 


We have implemented a complete version of CF on 

the TinyOS [27]/MICAz platform in nesC [5]. The fol- 
lowing two protocols are also implemented as a baseline: 
e Standard Flooding (FLD): Every node rebroadcasts 
the first-time received packet exactly once. 
e Reliable Broadcast Propagation (RBP): RBP [34] 
was proposed in SenSys’06. As in standard flooding, in 
RBP, every node unconditionally rebroadcasts the first— 
time received packet once. Then the node adjusts the 
number of retries based on the neighborhood density. If 
there exists a bottleneck link from current node (1) to its 
downstream node (D), node N performs up to the maxi-— 
mum number of retries when it does not receive the ACK 
from D. 

Four metrics are used to evaluate the protocols: 

e Reliability: Reliability is quantified by the percentage 
of nodes in a network that receive the flooding packets. 
e Message Overhead: Message overhead is measured 
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Figure 10. The Performance of Single Hop Indoor Experiment 


by the total number of data packets transmitted during 
the experiment period. We do not count hello messages 
into the overhead for two reasons. First, the overhead of 
hello messages is highly environment-dependent. It can 
be very high in extremely dynamic environments, and 
very low in static environments. For example, in our in— 
door experiment, the average hourly hello messages are 
73 packets per node during the day and 17 packets per 
node during the night. Second, most flooding designs 
need hello messages for neighbor information mainte— 
nance. We acknowledge that CF has extra overhead to 
exchange bitmap within a hello message every wT time 
interval to calculate CPRP. However, this overhead is in— 
dependent of data traffic, hence the cost is amortized over 
multiple flooding operations. 


e Dissemination Delay: Dissemination delay is the du— 
ration from the time that either the source initiates the 
packet to the time the last node receives the packet or no 
more nodes resend the packet for a single flood. 


e Load Balance: This is indicated by the standard devi— 
ation of the number of packets transmitted per node per 
flood. This metric measures how evenly the rebroadcast-— 
ing activities are distributed in the network. 


5.1 Experiment Setup 


During the experiments, we placed MICAz nodes in 
indoor and outdoor environments and tuned the trans— 
mission power to ensure the multi-hop communication 
between the source node and the other nodes. In the 
experiments, after deployment all the nodes were syn— 
chronized and started the neighbor discovery by sending 
out the hello messages. After all the nodes got the link 
quality and conditional packet reception ratio informa-— 
tion about their 1-hop neighbors, a sender was selected 
to send out 100 data packets with a time interval of 10 
seconds. For performance analysis purposes, in each 
data packet we included information such as hop count, 
time stamp, and the previous hop’s node ID. Upon re- 
ceiving the data packet, the intermediate node recorded 
this information in its flash memory. Every node also 
recorded the number of transmissions it conducted for 
each data packet, which was identified by the sequence 
number. For all the protocols, we maintained the same 
network placement during the experiments. Unless ex— 
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plicitly stated otherwise, we used the above default val— 
ues in all the experiments. 


5.2 Single Hop Indoor Experiments 


The indoor scenario represents potential applications 
including facility management [38], data center sensing 
[16], and structural monitoring [23]. In this experiment, 
one MICAz node was placed as a sender in the center of 
the indoor 7.5m x 2.5m testbed, and the other 19 MICAz 
nodes were randomly deployed around the sender. The 
transmission power was tuned to ensure that all of these 
19 nodes were within the sender’s transmission range, 
although not necessarily within each other’s transmission 
range. The link qualities (1.e., packet reception ratios of 
the receivers) from the sender to these 19 nodes varied 
between 100% and 7%. 


As shown in Figure 10(a), due to the unreliable wire— 
less links, standard flooding (FLD) has only 79.9% reli— 
ability. CF, however, achieves the same 100% reliability 
as RBP. This is because in CF, based on the link qual-— 
ity and strong conditional packet reception ratio infor— 
mation, every node can accurately estimate whether all 
its 1-hop neighbors receive the packet. This accurate es— 
timation also results in a lower number of transmissions 
inside the network. Figure 10(b) compares the total num— 
ber of packets transmitted for every data packet initiated 
from the sender when running RBP, standard flooding 
(FLD), and CF protocols. The average values of the to— 
tal number of packets transmitted for RBP, FLD, and CF 
are 34.64, 15.98, and 19.5, respectively. Compared with 
RBP, CF reduces 43.7% of the total number of packets 
transmitted, while maintaining the same 100% reliabil— 
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Figure 12. The Performance of Multi-hop Indoor Experiment 





Figure 13. Outdoor Experiment Site: On a a Bridge 
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Figure 14. The Performance of Outdoor Linear Network Experiment 


ity. Due to the low reliability of FLD, 20.1% of nodes 
do not receive the packet, and thus FLD has the smallest 
total number of transmissions. 


Figure 10(c) compares the dissemination delays of the 
three protocols for all 100 data packets initiated by the 
sender. The average delay for CF is 0.571s, for RBP 
is 1.09s, and for FLD is 0.69s. Since in CF the node 
with the largest transmission effectiveness has the small— 
est back-off time, CF’s dissemination delay is 47.6% less 
than RBP’s. As shown in Figure 11, the standard devi- 
ations of CF and RBP are very close, but the average 
numbers of transmissions per node are different (1.73 
for RBP, 0.975 for CF). The reason is that in RBP every 
node unconditionally rebroadcasts the first-time received 
packet once. In this way, RBP has a larger number of 
transmissions and a smaller standard deviation than CF. 


5.3. Multi-hop Indoor Experiments 


To further investigate the performance of CF in a 
larger scale and denser network, a multi-hop indoor ex— 
periment was conducted. In this experiment, the MI- 
CAz node as a sender was placed in the bottom left 
boundary of the indoor 7.5m x 2.5m testbed and the 
other 36 MICAz nodes were randomly deployed on the 
testbed. Figure 12(a) shows that the reliability of CF, 


standard flooding (FLD), and RBP is 99.97%, 93.25%, 
and 99.89%, respectively. CF achieves the same reliabil— 
ity as RBP, but CF reduces the number of transmissions 
by 50.7%, as shown in Figure 12(b). The reduction is 
larger than the 43.7% reduction of number of transmis— 
sions in the sparser network (discussed in Section 5.2). 
This is because in the denser network, the node running 
CF has more |-hop neighbors, which would further help 
the node accurately predict whether its 1-hop neighbors 
have received the packet. CF also has fewer transmis— 
sions than does standard flooding. For instance, the aver— 
age values of the total number of packets transmitted by 
using RBP, standard flooding, and CF are 62.94, 33.21, 
and 31.04, respectively. CF has less number of pack— 
ets transmitted, which translates to the less amount of 
energy consumption. In addition, CF does not prevent 
MAC layer energy management such as low power lis— 
tening (LPL) [28] and SCP-MAC [41]. For example, 
in LPL, nodes briefly wake up to check channel activ— 
ity without actually receiving data. If the channel is not 
idle, the node stays awake to receive data. Otherwise it 
immediately goes back to sleep. In this way, LPL proto— 
cols consume much less energy than they would listening 
for full contention period. 
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Figure 16. Insight Analysis of Outdoor Experiments 
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Figure 12(c) compares the dissemination delay of 
these three protocols. The average delay for RBP, stan— 
dard flooding, and CF is 1.73s, 1.41s, and 0.89s, respec— 
tively. By relying on the node with the largest transmis— 
sion effectiveness to do the transmission first, the average 
delay of CF is 48.5% less than that of RBP. The standard 
deviation of CF, standard flooding, and RBP is 0.866, 
0.288, and 0.669, respectively, as shown in Figure 11. 
CF has a slightly higher standard deviation than RBP. 
Compared with the sparser network (1.e., single hop in— 
door), the standard deviation of CF decreases in a denser 
network (i.e., multi-hop indoor). This is because in a 
denser network, when running CF, the nodes can more 
dynamically choose the path to propagate the packets. 


5.4 Outdoor Linear Network Experiments 

The outdoor experiment represents such potential ap— 
plications as monitoring remote infrastructures or envi— 
ronments [30, 37, 42]. In the experiment, 48 MICAz 
nodes were deployed along a 326-meterong bridge, as 
shown in Figure 13. For a fair comparison, we im— 
plemented two versions of RBP: RBP4 and RBP8. In 
RBP4, we set the bidirectional link quality threshold for 
two nodes to be considered neighboring nodes at 50%, 
and the maximum number of retries when an insuffi-— 
cient number of neighbors got the packet or an important 
neighbor did not get it at 4. To improve the reliability, 
in RBP$8, the bidirectional link quality threshold was re— 
duced to 30% and the number of retries was set at 8. V1 
was selected to be the sender that initiated the broadcast— 
ing of the data packets. 

As shown in Figure 14(a), the reliability of RBP8, 
RBP4, CF, and standard flooding is 99.96%, 97.6%, 
99.93%, and 61.96%, respectively. With the link quality 
and conditional packet reception ratio information, CF 
achieves the same reliability as RBP8 and reduces the 


NSDI ’10: 7th USENIX Symposium on Networked Systems Design and Implementation 


number of packets transmitted by 31.2% (shown in Fig-— 
ure 14(b)). The average values of the number of packets 
transmitted for RBP8, RBP4, CF, and standard flooding 
are 116.64, 91.54, 80.26, and 26.58, respectively. 

Due to the bottleneck links (further discussed in Sec— 
tion 5.5.4), in RBP4, the data packet with sequence num— 
ber 7 is received by only 12 nodes in the network, which 
results in a deep drop of the total number of packets 
transmitted in Figure 14(b). As shown in Figure 14(c), 
the average dissemination delay of RBP8, RBP4, CF, 
and standard flooding is 4.46s, 3.93s, 2.85s, and 2.34s, 
respectively. The average delay of CF is 36% less than 
that of RBP8. As shown in Figure 11, the standard devi-— 
ation of RBP8, RBP4, CF, and standard flooding is 1.9, 
1.05, 1.6, and 0.5, respectively. Compared with the in— 
door experiments, in the outdoor experiment each node 
has fewer neighbors, resulting in more unbalanced trans— 
missions among these nodes, which explains why the 
standard deviation of the outdoor experiment is larger. 
5.5 System Insight Analysis 

In the previous sections, we showed that CF has better 
performance than standard flooding and RBP. In this sec— 
tion, we explain why this is the case by revealing some 
system insights. 


5.5.1 Number of Neighbors 

Figure 15(a) and 16(a) compare the CDF of each 
node’s neighbor size when running CF and RBP for the 
multi-hop indoor and outdoor experiments. In RBP, two 
nodes are considered as neighbors if and only if the bidi— 
rectional link qualities between them are higher than a 
threshold. While in CFE, there is no constraint on the link 
quality, thus nodes have more neighbors when running 
CF than when running RBP. The maximum number of 
neighbors that CF and RBP have is 17 and 9, respec— 
tively, for the indoor experiment, and 11 and 7, respec— 
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tively, for the outdoor experiment. In CF, more neigh— 
bors indicates that more information can be utilized by 
the node to predict the coverage of its neighbors using 
collective ACKs. 


5.5.2 Prediction Accuracy 


In the previous section, we illustrated that CF has 
more |-hop neighbors than RBP. In this section, we show 
that these 1-hop neighbors provide more information for 
the node running CF, which can more accurately pre— 
dict whether its neighbors receive the packet. Duplicated 
transmissions happen when the sender does not realize 
that the receiver already received the packet and retrans- 
mits the packet. Both RBP and standard flooding do not 
utilize the information of conditional packet reception ra— 
tio to predict the packet reception of neighboring nodes, 
which results in a higher number of duplicated transmis— 
sions. Figure 15(b) and 16(b) show the CDF curve of 
the number of duplicated packets received for the same 
sequence number of the data packet by all the nodes run— 
ning CF, standard flooding, and RBP, respectively. By 
tracing the logged data, we also include an Oracle so— 
lution, in which the node running CF, instead of doing 
the coverage probability estimation, stops transmission 
once its neighboring nodes receive the packet. From the 
figure, we can see that CF has a smaller number of dupli- 
cated packets received than does RBP in both the outdoor 
and indoor experiments. The Oracle and CF curves are 
very close, which indicates that the node running CF can 
accurately predict whether its neighbors have received 
the packet. Due to the accurate prediction, CF achieves 
the same reliability as RBP, while it has fewer duplicated 
transmissions. 


5.5.3 Efficiency in Delivery Paths 


To trace the experiment, hop count information was 
attached to the data packet. Figure 15(c) and 16(c) shows 
the CDF of the number of hops the data packets trav— 
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eled in order to reach the node. In CF, the node with the 
largest transmission effectiveness has the smallest back— 
off timer. This back-off mechanism significantly reduces 
the number of hops the data packets traveled. For ex— 
ample, the maximum number of hops for CF and RBP 
is 4 and 6, respectively, in the indoor experiment, and 9 
and 13, respectively, in the outdoor experiment. Standard 
flooding has a similar path length as RBP, but in the out— 
door experiment the maximum number of nodes covered 
by standard flooding was 41 out of 48. That is why the 
FLD curve terminates in the upper-right corner of Figure 
16(c). Due to the smaller number of hops traveled, CF 
has a smaller dissemination delay than RBP. 


5.5.4 Asymmetric and Bottleneck Links 


Figure 17 shows the link qualities between node N1 
and all its neighboring nodes. In the figure, a cross rep— 
resents a node that does not have a direct link with V1; 
a square represents a node that has a unidirectional link 
with V1; and a round dot represents a node that has bidi-— 
rectional links with N1. We use N8(21/99) to represent 
that the link quality from N1 to N8 is 21%, while the link 
quality from N8 to N1 is 99%. Similar notation is used 
for all the other nodes. From the figure, we find a large 
number of asymmetric or even unidirectional links. If we 
run RBP, N1 selects only three nodes (N2, N3, and N9) 
as neighbors that have bidirectional link qualities higher 
than the threshold (60%). We also note that some nodes, 
such as N13, have shorter distances to N1 than N17 but 
have no connection to N1. 


A similar phenomenon also happens in the outdoor 
experiment. Figure 18 shows the link qualities between 
node N12 and some of its neighboring nodes. Due to the 
environmental effect, the link quality from N12 to N13 
is only 23%, while the backward link quality from N13 
to N12 is 87%. Moreover, although the physical distance 
between N12 and N13 1s closer than the distance between 
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Figure 19. Impact of Node Density 


N12 and N15, the link quality from N12 to N13 1s lower 
than the link quality from N12 to N15, which is 55%. 

Since the bidirectional link qualities between node 
N12 and nodes N13 and N14 are below the thresholds of 
RBP4 and RBP8, which are 50% and 30%, respectively, 
both RBP4 and RBP8 exclude N13 and N14 from N12’s 
neighbor table. This introduces two effects: (1) N13 and 
N14 may not receive the packet from N12, leading to (11) 
a bottleneck link between N12 and N15, because in RBP, 
the node only retries up to the maximum number of re- 
transmissions if it does not hear the downstream node’s 
ACK. For RBP4, the maximum number of retransmis— 
sion is 4, meaning that in this specific topology, there 
is still a 4.1% probability that N15 will not receive the 
packet after N12’s 4th retry. As shown in Figure 14(b), 
the data packet with sequence number 7 could not be re— 
ceived by N15; thus the packet stopped propagation in 
the network. 

As discussed in Section 3.1, the CF protocol can over— 
come the difficulties brought about by asymmetric links 
through the information on link quality and conditional 
packet receptions. In this scenario, as a receiver, N12 can 
estimate the packet reception probability of N13 and N15 
based on overhearing NV14’s transmission because N14 
has larger transmission effectiveness than N15. There- 
fore, N14 transmits earlier than N15. Moreover, as a 
sender, N12 can also estimate the packet reception prob-— 
abilities of N13 and N15 based on N12’s own transmis— 
sion. 


6 Simulation Evaluation 


In order to understand the performance of the pro-— 
posed CF scheme under numerous network settings, in 
this section we provide extensive simulation results. We 
compared the performance of CF with the following two 
state-of-the-art solutions and an Oracle approach: 

e Reliable Broadcast Propagation (RBP) [34] by F 
Stann et al. in SenSys’06. 

e Double-Covered Broadcast (DCB) [18] by Wei Lou 
and Jie Wu in INFOCOM’ 04. In DCB, every node main— 
tains 2-hop neighbor information. When a sender broad— 
casts a packet, it greedily selects the forwarders from 
its 1-hop neighbor set based on two criteria: (1) the re— 
broadcasts by the forwarders cover all the sender’s 2- 
hop neighbors, and (11) the sender’s 1-hop non-forwarder 
neighbors need to be covered by at least two forwarders, 
including the sender itself. 
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e Oracle: In addition to the state-of-the-art solutions, 
we also include a theoretical “best-case” bound provided 
by an Oracle. In the Oracle approach, we assume there 
exists a perfect cost-ree ACK in CF, so instead of do- 
ing the coverage probability estimation, the node will ex— 
actly know whether or not its neighbors have received the 
packet. We note that the Oracle approach is not optimal, 
but it serves as a good baseline. 


6.1 Simulation Setup 

We simulated our design with ns. Our simula— 
tion MAC layer provided multiple access with collision 
avoidance. The MAC layer worked in the broadcast 
mode with no ACKs and retransmissions. The radio 
model was implemented based on our empirical data, 
which has the CPRP feature as described in Section 2. 

In the simulation, we randomly deployed 250 sensor 
nodes in a 200m x 200m square field. A source node was 
positioned near the boundary of the field, and the source 
sent out the data packet with 29 bytes payload every 10 
seconds. The total simulation time was set at 3200 sec— 
onds. In order to avoid the initialization bias of the sys— 
tem state on the broadcast operation, the source did not 
send out the data packet in the first 100 seconds, but ex— 
changed only hello messages between neighboring nodes 
to establish the neighborhood information. Similarly, 
to make sure that all the broadcast packets propagated 
throughout the network, the source stopped sending out 
the data packet after 3100 seconds. Every data point on a 
graph represents the averaged value of 10 runs, and 95% 
confidence intervals for the data are within 2~8% of the 
mean shown. Unless explicitly stated otherwise, we used 
the above default values in our simulation. 


6.2 Impact of Node Density 

In this experiment, we analyzed the effect of node 
density by varying the number of nodes in the field from 
50 nodes to 250 nodes. 

Figure 19(a) shows that the reliability of all the pro— 
tocols increases as the network density increases. When 
the node density varies, CF has more than 99% reliabil— 
ity, while the mean value of reliability of DCB varies 
from 0.92 to 0.962. This is because DCB uses only two 
forwarding nodes to cover the non-forwarding nodes. If 
the transmissions from both of the forwarders failed, the 
non-forwarding node will not be covered. When the node 
density is low, RBP has lower reliability than CF. The 
reason is that in RBP two nodes are considered as neigh— 
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bors only when the link quality between them is larger 
than a set threshold (60% according to [34]). In a sparse 
network, some nodes in a sparse area may not be con- 
sidered as neighboring nodes of any other nodes in the 
network. Those in sparse area nodes thus will have a 
lower chance of receiving the packet from their neigh— 
bors when running RBP. However, in CF the node esti- 
mates its neighbors’ packet receptions based on the link 
quality information. As long as these nodes are physi-— 
cally connected, CF can provide high reliability. 


Figure 19(b) shows that when the network density in— 
creases, the total number of packets transmitted linearly 
increases in RBP but slightly increases in both DCB and 
CF. The reason is that in RBP, every node needs to re— 
transmit the packet that it receives for the first time. DCB 
does not provide an optimal algorithm such that the num— 
ber of forwarders is the minimum and the non-forward 
nodes are covered exactly by only two forwarders. In 
DCB, when the network density increases, the number 
of forwarders slightly increases, which results in an in— 
crease in the total number of packet transmissions. In CF, 
when the network density increases, every node needs to 
cover more neighbors, which results in the slight increase 
in the total number of packets transmitted. More neigh— 
boring nodes also helps the node predict its neighbors’ 
coverage probability, which results in the decrease in the 
gap between CF and Oracle in Figure 19(b). 


Figure 19(c) shows that when the network density in— 
creases, the end-to-end delay decreases in CF and RBP 
but increases in DCB. This is because when the network 
density increases, the number of retransmissions used in 
CF and RBP decreases, while in DCB, when the network 
density increases, the number of forwarders slightly in— 
creases. Every forwarder needs to do back-off and re— 
transmits the packet if it does not hear the retransmis— 
sions from its successors that are selected as forwarders. 


Figure 19(d) shows that when the network density in— 
creases, the standard deviation of the number of data 
packets transmitted per node for all the protocols first 
increases and then decreases. This is because when the 
node density increases, the network will switch from for- 
warder dominating to non-forwarder dominating for all 
the protocols. DCB has the highest deviation, which is 
because it always selects the forwarders to do the re— 
transmission. In high-density networks, the number of 


transmissions needed by the nodes is balanced, but RBP 
has a slightly lower standard deviation than CF. This is 
because RBP requires every node to do retransmission at 
least once, while in CF some nodes do not need to do 
retransmission if collective ACKs provide sufficient evi— 
dence that their neighboring nodes are covered already. 


6.3. Impact of Lossy Links 


In this experiment, we analyze the effects of varying 
link qualities. The average link quality varies from 60% 
to 100%. 

Figure 20(a) and Figure 20(b) show that as the link 
quality increases, the total number of packets transmit— 
ted decreases while the reliability increases for all the 
protocols. When the average link quality is 60%, the to— 
tal number of packets transmitted by CF, DCB, and RBP 
is 25323, 53812, and 74132, respectively, while the mean 
value of reliability is 0.992, 0.78, and 0.89 for CF, DCB, 
and RBP, respectively. The total number of packets trans— 
mitted in RBP is 2.9 times more than that in CF. In order 
to let DCB achieve higher reliability, we set the maxi-— 
mum number of retransmissions at 4 for the forwarding 
nodes, which causes the flat period of DCB in Figure 
20(b) when the link quality increases from 60% to 80%. 

Figure 20(c) shows that the end-to-end delay de-— 
creases for all the protocols as the link quality increases. 
This is because the better the link quality, the fewer back— 
offs and retries needed by all the protocols. Figure 20(d) 
shows that when the link quality increases, the standard 
deviation decreases for RBP and DCB. For RBP, when 
the link quality is 100%, every node still needs to retrans— 
mit the packet that is received for the first time, which 
results in the standard deviation value of 0, while for CF, 
the standard deviation increases as the link quality in— 
creases. This is because as link quality increases, the 
nodes that do retransmission become centralized. When 
the link quality equals 100%, CF becomes the protocol 
that relies on the nodes with high connectivity to do the 
retransmission. 


6.4 Impact of the Reliability Threshold 


CF uses o to control the reliability desired by the 
users. Technically, the @ value is the threshold that is 
used by each node to check whether its neighbors can be 
considered as covered. In this section, we evaluate the 
impact of o value. The total number of nodes in our sim— 
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ulation is 150. & value varies from 0.05 to 0.95 with step 
0.05. Figure 21(a) shows that the reliability curve of CF 
is above the diagonal line, indicating that CF satisfies the 
user requirement well. We note that the closer the relia— 
bility curve of CF from the diagonal line, the better CF 
can track the requirement. Figure 21(a) shows the largest 
difference between desired and actual & values is about 
30.1% when a = 0.15. The average difference is about 
12.9%, which is a satisfactory performance under high 
link dynamics. 

Figure 21(b) shows that as the o value increases, the 
total number of packets transmitted also increases. When 
the & value increases from 0.55 to 0.95, the gap between 
CF and Oracle also increases. For example, when the a 
value is 0.95, the difference in the total number of pack— 
ets transmitted between CF and Oracle is 7280. 

Figure 21(a) and 21(b) hint that setting & value to be 
0.9 is a good choice for reliable flooding. It achieves 
the reliability of 0.992 with 4987 fewer packets transmit— 
ted than when the o is 0.95, indicating that approaching 
100% reliability would be prohibitively expensive under 
unreliable communication environments. 


7 State of the Art 


The literature in flooding protocol designs can be clas— 
sified into two categories: deterministic approaches and 
probabilistic approaches. 

In the deterministic approaches, a fixed node within a 
connected dominating set is determined as a forwarding 
node. These approaches are also called fixed-forwarder 
approaches. In these approaches, the connected dominat-— 
ing set is calculated by using global or local information. 
It has been proved [20] that the creation of a minimum 
connected dominating set (MCDS) is NP-complete, so 
most approaches [12, 18] attempt to find a good approxi-— 
mation to the MCDS. Double-Covered Broadcast (DCB) 
[18] provides high reliability when the packet loss ratio 1s 
low. Compared with CF, DCB introduces more overhead 
to maintain 2-hop neighbor information. Its reliability is 
affected by the link quality between the forwarders. 

In a probabilistic approach, when a node receives a 
packet, it forwards the packet with probability p. The 
value of p is determined by relevant information gath— 
ered at each node. Simple probabilistic approaches, such 
as [24], predefine a single probability for every node to 
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rebroadcast the received packet. When running the above 
protocols in a network with different node densities, the 
nodes in a dense area may receive a lot of redundant 
transmissions. More complicated and efficient protocols, 
such as distance-based and location-based [40] schemes, 
use either area or precise position information to reduce 
the number of redundant transmissions. 

Although the probabilistic scheme without ACKs pro-— 
vides a good stochastic result, it has relatively low re— 
liability under unreliable wireless environments. The 
gossip-based approach [13] provides high reliability, us— 
ing multiple rounds of message exchanges. Moreover, 
instead of overhearing, it needs the exchange of meta— 
data. For applications, such as reprogramming an en— 
tire sensor network, perfectly propagating the code to all 
the nodes in the network is required. Trickle [26] uses 
gossiping and link4ayer broadcasting to propagate small 
code updates. RBP [34] is used for the applications such 
as routing and resource discovery. In ADB [35], the node 
uses the footer in DATA and ACK frames to make the re— 
broadcast decisions. However, under unreliable wireless 
environment, the loss of an ACK from a receiver will 
cause the sender to treat the receiver as a 100% uncov— 
ered node and redundant transmissions are conducted. 
Moreover, such explicit ACKs may cause collision [11] 
in dense networks. 


As a flooding protocol, CF is different from ExOR [3] 
which is a data forwarding protocol. In ExOR, all the 
nodes work together to forward the data from a source 
to a single destination. The “batch map” used in ExOR 
serves as an explicit ACK for each received packet, 
which is different from the collective ACKs which are 
achieved in an accumulative manner. 

Despite this rich literature, the existing approaches 
do not exploit link correlation for performance improve— 
ment. Using link correlation, collective ACK becomes 
the key difference between CF and previous work. More 
specifically, the CF protocol has two new contributions: 
(1) instead of using an implicit or explicit ACK [1], each 
node dynamically estimates and accumulates its neigh— 
bors’ coverage status through collective ACKs by using 
the correlation of packet receptions among neighboring 
nodes; (11) the forwarders are dynamically selected in a 
distributed fashion based on the nodes’ realtime estima— 
tions of their neighbors’ packet reception status. These 
two features lead to reliable and efficient message dis— 
semination. Finally, we note that the concept of collec- 
tive ACKs is independent of specific protocol designs. 
It could be used as an add-on feature to other routing 
[19, 7, 36, 44] and flooding [9] protocols. We leave this 
as future work. 


$ Conclusions 


In this paper, we propose CF to provide efficient and 
reliable message dissemination service with low com— 
plexity. We demonstrate that CF is effective through two 
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main mechanisms: collective ACKs and dynamic for- 
warder selection. Both mechanisms take advantage of 
link correlation among neighboring receivers. This is the 
first work that transforms the direct ACKs per receiver 
into a collective one. This unique design noticeably re- 
duces the redundancy in rebroadcasting, as shown in our 
evaluation. We fully implemented and evaluated the CF 
protocol in several testbeds including a single hop net-— 
work with 20 MICAz nodes, a multi-hop network with 
37 MICAz nodes, and a linear outdoor network with 
48 nodes along a 326-metertong bridge. We also per-— 
formed extensive simulation with various network con— 
figurations to reveal its performance at scale. The results 
show that the CF protocol has low overhead, low dissem— 
ination delay, and high reliability in unreliable wireless 
environments. Conceptually, the design of CF protocol is 
generic enough to be applied to wireless mesh networks 
and other stationary wireless networks. However, it is 
necessary to systematically investigate the implication of 
running CF in these types of networks in the future. 
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Abstract — With the advent of new FCC policies on 
spectrum allocation for next generation wireless devices, 
we have a rare opportunity to redesign spectrum access 
protocols to support demanding, latency-sensitive appli- 
cations such as high-def media streaming in home net- 
works. Given their low tolerance for traffic delays and 
disruptions, these applications are ill-suited for tradi- 
tional, contention-based CSMA protocols. 

In this paper, we explore an alternative approach to 
spectrum access that relies on frequency-agile radios to 
perform interference-free transmission across orthogonal 
frequencies. We describe Jello, a MAC overlay where 
devices sense and occupy unused spectrum without cen- 
tral coordination or dedicated radio for control. We 
show that over time, spectrum fragmentation can signif- 
icantly reduce usable spectrum in the system. Jello ad- 
dresses this using two complementary techniques: online 
spectrum defragmentation, where active devices period- 
ically migrate spectrum usage, and non-contiguous ac- 
cess, which allows a single flow to utilize multiple spec- 
trum fragments. Our prototype on an 8-node GNU radio 
testbed shows that Jello significantly reduces spectrum 
fragmentation and provides high utilization while adapt- 
ing to client flows’ changing traffic demands. 


1 Introduction 


The future is bright for next-generation wireless devices. 
While current technologies are limited to operating in 
fixed ranges of increasingly congested spectrum, reforms 
in spectrum management policy promise to free up spec- 
trum in the near future. The Federal Communications 
Commission (FCC) has auctioned recently vacated wire- 
less spectrum to service providers [9]. To further de- 
mocratize the use of this spectrum, online spectrum trad- 
ing services such as SpecEX (www.spectrumbridge.com) 
now allow small service providers to purchase/rent spec- 
trum directly from regional owners. 

Unlike unlicensed bands used by current wireless de- 
vices, these new spectrum ranges are large and uncon- 


gested. We can take advantage of the opportunity to re- 
design access mechanisms to support a broader range of 
wireless applications. For example, current wireless ac- 
cess mechanisms are designed for best effort traffic, and 
generally rely on spectrum contention as used in COMA 
protocols and their variants. The network partitions spec- 
trum into fixed channels, lets each transmission choose a 
channel and contend in time with its peers. While this 
approach works quite well for file transfers and interac- 
tive applications, past work shows that supporting appli- 
cations with real-time requirements requires additional 
modifications that incur significant overheads [25, 27]. 


In this paper, we reconsider the design of spectrum ac- 
cess mechanisms in dynamic spectrum networks to sup- 
port applications within more restrictive traffic classes. 
Specifically, we consider supporting applications with 
strong quality of service requirements such as high- 
definition multimedia flows in media rich environments 
like the home. Traffic demands for these flows can vary 
significantly over time, but can generally be predicted 
ahead of time. Unlike best-effort traffic applications, 
these multimedia flows require dedicated spectrum ac- 
cess to minimize disruptions to their transmissions and 
to maintain the expected quality of user experience. 


We make two observations that make existing 
contention-based systems unsuitable for these applica- 
tions. First, per-packet contention produces frequent and 
unpredictable transmission disruptions, which would in- 
terfere with our desired traffic delivery constraints. In 
contrast, if multiple transmissions were allocated iso- 
lated frequencies, each flow would obtain necessary ded- 
icated spectrum, while avoiding costly interference that 
traditionally leads to contention and communication de- 
lays [18]. Second, splitting spectrum into fixed channel 
partitions is also unattractive for applications with time- 
varying bandwidth demands. Fixed partitions prevent 
flows from using or releasing available spectrum as nec- 
essary, and would lead to inefficient spectrum usage [8]. 
In this respect, new hardware in the form of frequency- 
agile radios can be extremely useful. With these radios, a 
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Figure 1: Per-session FDMA: Simultaneous media sessions 
work in parallel on isolated frequencies, avoiding costly wire- 
less interference while adapting frequency usage to varying 
traffic demands. 


device examines locally available spectrum before each 
network connection, and directs its radio to operate on 
a frequency range that not only matches its traffic de- 
mands, but also lies orthogonal to existing transmissions. 
In addition, devices can grab and release spectrum as 
necessary without being confined by fixed partitions. 

Motivated by these observations, we propose a new 
distributed access technique that lets flows access spec- 
trum in the frequency domain and adapt their spectrum 
usage based on traffic demands (shown in Figure 1). 
We refer to this new access technique as “per-session 
FDMA,” where each session refers to a single contin- 
uous flow, and build a basic framework where traffic 
flows can independently select and adapt their frequency 
usage. First, by detecting “edges” on observed power 
spectrum maps, each device can accurately and quickly 
identify free spectrum in its local area. Second, each de- 
vice can select an available spectrum range based on its 
present traffic demands, using classical algorithms such 
as best fit, worst fit, and first fit [19]. Finally, we pro- 
pose a distributed coordination procedure to synchronize 
sender and receiver pairs in their spectrum usage. 

Several recent proposals describe systems that adapt 
spectrum usage based on bandwidth demands [15, 20, 
32]. In this context, our work builds an efficient frame- 
work that determines how device pairs sense and coordi- 
nate their access in open spectrum ranges. Our system is 
MAC-agnostic: once devices obtain spectrum using our 
primitives, they can use any MAC. 


Spectrum Fragmentation. Efforts to evaluate our ba- 
sic design reveal another fundamental challenge. Over 
time, as individual transmissions enter and exit the net- 
work or adjust their spectrum usage, available spectrum 
becomes increasingly divided into a collection of discrete 
fragments. This “spectrum fragmentation” means that a 
significant portion of spectrum, while free, is effectively 
unusable because its fragments do not provide the mini- 
mum contiguous spectrum range required by new flows. 
Our experiments show that this artifact does exist in prac- 
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tice, and leads to significant performance degradation 
even for networks with very few parallel transmissions. 

We propose two distinct, but complementary mech- 
anisms to address this fundamental problem: online 
spectrum defragmentation at the spectrum access layer, 
and noncontiguous frequency access at the physical 
layer. With online spectrum defragmentation, each pair 
of communicating devices voluntarily defragment spec- 
trum by moving to alternative frequencies, thereby opti- 
mizing spectrum availability for other sessions. These 
frequency moves occur periodically in a session or 
as flows adapt to changing spectrum demands. They 
are nearly instantaneous and transparent to neighbor- 
ing flows. Given our emphasis on minimizing disrup- 
tions, however, this technique cannot completely re- 
move spectrum fragmentation. As a complementary 
mechanism, we offer non-contiguous frequency access, 
where a radio can utilize multiple spectrum ranges in 
a single transmission. This provides support for high- 
bandwidth transmissions even in the presence of moder- 
ate levels of spectrum fragmentation. Our approach im- 
plements non-contiguous frequency access using a “‘dis- 
tributed OFDMA” mechanism, which differs from prior 
approaches like SWIFT [24] that rely on CSMA to share 
spectrum among frequency-agile radios. 

These two techniques work best in unison. Non- 
contiguous frequency access requires “frequency guard 
bands” between allocated frequency boundaries to elimi- 
nate cross frequency interference, similar to guard bands 
between WiFi channels. Since they are not usable for 
communication, guard bands represent spectrum over- 
head that increases as flows make use of more frag- 
mented spectrum ranges. Online spectrum fragmenta- 
tion, on the other hand, effectively suppresses the level 
of fragmentation. 


The Jello Overlay. Based on these two comple- 
mentary techniques, we design and implement Jello, a 
MAC overlay to support high-bandwidth real-time ap- 
plications. Jello does not require centralized spectrum 
controllers or dedicated radios for control traffic, making 
it a low-cost and easily deployed solution. Jello radios 
sense, identify and occupy usable frequencies based on 
traffic demands while minimizing spectrum fragmenta- 
tion. Where low levels of fragmentation remain, devices 
accommodate high-bandwidth transmissions using non- 
contiguous frequency access. We deploy a prototype of 
Jello on a 8-node USRP GNU radio testbed, and evalu- 
ate the benefits of online spectrum defragmentation and 
non-contiguous frequency access, both individually and 
together. Measurements show that Jello reduces disrup- 
tions to applications by as much as a factor of 8. 

Our work makes three key contributions. First, we 
explore spectrum access techniques for real-time wire- 
less applications with low tolerance for traffic disrup- 
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tions, and propose mechanisms for frequency-agile ra- 
dios to sense, occupy, and synchronize spectrum usage. 
Second, we identify the spectrum fragmentation chal- 
lenge, and propose two complementary solutions to max- 
imize spectrum utilization. Finally, we implement and 
deploy a prototype of Jello, a complete MAC overlay en- 
compassing our techniques. We evaluate the effective- 
ness of Jello mechanisms using both detailed measure- 
ments of an 8-node GNU-radio testbed and simulated 
experiments. Jello provides interference-free access to 
demanding applications while maximizing utilization of 
available radio spectrum, and can be deployed on hard- 
ware available today. 


2 A Case for Per-session FDMA 


The expected arrival of new wireless spectrum is an op- 
portunity to redesign spectrum access protocols to sup- 
port a richer set of network applications. In particu- 
lar, available spectrum can be used to support “soft real- 
time” applications, i.e. applications such as multimedia 
streaming that have very low tolerance for data loss, de- 
lays and jitter. 

Given their strong demands on the underlying wire- 
less network, these applications do not perform well on 
CSMA protocols that require parallel flows to perform 
per-packet contention. Recent experimental results show 
that such contention leads to unpredictable network de- 
lays and disruptions [25,27], ultimately resulting in vis- 
ible disruptions to the application-level user experience. 
Quality of Service extensions such as IEEE 802.1 le can 
prioritize traffic, but does not prevent contention be- 
tween multiple flows in the same traffic class, e.g. video 
streams in neighboring houses. An alternative for pre- 
dictable traffic delivery is to employ Time Division Mul- 
tiplexing (TDM) to obtain a collision-free transmission 
schedule. However, this requires fine-grain network- 
wide time synchronization and scheduling, which are 
difficult to implement in practice. 


Assumptions. Our focus is on supporting demand- 
ing wireless media applications. We assume that these 
applications operate in a dedicated spectrum band, gen- 
erate continuous traffic with time-varying load, and have 
strong quality of service requirements. In environments 
where they must co-exist with legacy systems using best- 
effort traffic, we envision that local wireless spectrum 
can be partitioned into two ranges for isolation. One 
range is dedicated to legacy applications using 802.11 
CSMA, and the other is dedicated for media-streaming 
applications running our proposed protocols. 


Frequency-agile Radios. Recent hardware advances 
have produced “frequency-agile radios,” wireless radios 
capable of operating across a wide range of frequencies 
and jumping between them in milliseconds. Currently 


available hardware includes the WARP [30], USRP [21], 
AirBlue [16] and SORA [28], with more expected in the 
next few years. With these radios, we can now consider 
per-session FDMA, or Frequency Division Multiplex- 
ing Access. In this approach, parallel sessions occupy 
orthogonal spectrum ranges, thus completely avoiding 
cross-flow interference. When a media session starts, the 
two end-devices involved choose a free frequency block 
to set up packet transmissions. As shown in Figure 1, 
flows can adapt their frequency usage over time as their 
bandwidth demands vary, thus using time multiplexing to 
make the best use of radio spectrum. Recent work [15] 
shows that adapting spectrum on demand leads to 75% 
improvement over 802.1 1b. 

Our approach differs from the concept of adapting fre- 
quency bandwidth on conventional 802.11 devices [8], 
where 802.11 channels can change their width to 40, 
20, 10 or SMHz by adjusting clock cycles. Our experi- 
ments show that scaling up traffic to fixed channel widths 
can reduce utilization up to 30% in our application 
scenarios. In comparison, per-session FDMA operates 
across wider spectrum ranges at fine granularities to en- 
sure high utilization, completely eliminates CSMA traf- 
fic contention. Furthermore, each link now can flexibly 
combine multiple spectrum ranges to form high band- 
width transmission. The proposed per-session FOMA 
can work on any of the current frequency-agile radio de- 
signs [2, 16,21, 24, 28, 30]. Since our approach operates 
directly on frequency bands, and uses frequency selec- 
tion to avoid access conflicts, we also differ from prior 
work [15] that uses pseudo-random spreading codes to 
implement random spectrum access. 


Challenges. A practical per-session FDMA system 
for wide-spread deployment needs to support soft real- 
time applications without relying on centralized spec- 
trum controllers or costly dedicated radios for control 
traffic. Such a system must address several key chal- 
lenges. First, to avoid disrupting ongoing transmissions, 
devices must be able to accurately and quickly iden- 
tify free frequencies. Second, each transmission pair 
needs to select a free spectrum block based on their traf- 
fic demand while minimizing spectrum fragmentation. 
They also must do so without disrupting other ongoing 
transmissions, and without the help of any control ra- 
dio. Similarly when a transmission pair needs to change 
frequency usage to accommodate variations in traffic de- 
mand (those cannot be handled by MAC rate adaptation), 
they also need to make the process transparent to others. 


3 Jello Framework 


To address these challenges, we propose Jello, a light- 
weight MAC overlay system that realizes distributed per- 
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session FDMA. Jello radios sense, identify and occupy 
usable frequencies to support time-varying traffic de- 
mands and to avoid interfering with each other. Each 
Jello device has a single half-duplex frequency-agile ra- 
dio for wireless communication, and does not require any 
central control or dedicated control radio. 


3.1 Identifying Usable Spectrum 


When accessing spectrum, Jello devices must avoid con- 
flicting with other ongoing sessions. Jello achieves this 
by performing spectrum sensing to quickly and accu- 
rately identify usable spectrum ranges. Unlike the time- 
domain sensing approach [4], Jello uses a frequency- 
domain mechanism, benefiting from its radio hardware’s 
frequency-agility. Unlike WiFi devices that sequentially 
scan channels, a frequency-agile radio can listen to the 
entire spectrum span, as demonstrated by several avail- 
able radio platforms [24, 30]. Using the frequency- 
domain signal, each radio constructs a power spectral 
density (PSD) map [13] that measures the energy level 
on each small frequency range. 

To identify usable frequency blocks, conventional ap- 
proaches perform energy detection on the PSD map [10]. 
For a given threshold Denerg,, each radio treats fre- 
quency ranges with energy higher than Denergy as busy 
and the rest as unoccupied. The detection accuracy, how- 
ever, is shown to be highly sensitive to the choice of 
Penergy and finding a uniformly optimal Penergy 18 un- 
realistic [24]. Recent work proposes to cross-validate the 
detection result by “poking” transmissions on “busy” fre- 
quency ranges and observing their reactions [24]. Each 
poking event disrupts existing transmissions, forcing 
them to move to other frequencies or change their trans- 
mission parameters. Thus while this solution works for 
transmissions that are highly resilient to frequent disrup- 
tions, it would cause serious performance issues for the 
media sessions our system targets. 


Sensing via Edge Detection. We exploit a unique 
property of radio transmissions in the frequency domain 
for accurate detection. To avoid interference to other 
transmissions, OFDM based transmitters use filters to 
limit the radio energy within certain frequency bands. As 
a result, the PSD profile of each transmission has clear 
edges on the frequency band boundaries, regardless of 
energy levels (shown in Figure 2). We can reliably iden- 
tify usable frequency blocks by identifying these edges. 
Our edge detection mechanism works as follows. 
First, as a pre-processing step, we smooth the PSD map 
by averaging it over multiple consecutive observations 
and applying two coarse power thresholds to filter out 
obvious frequency ranges. Frequency ranges with very 
high power are treated as busy and very low power ones 
as occupied. This pre-processing aims to filter out most 
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Figure 2: A sample PSD map and its first-order derivative. 
Jello identifies occupied frequency blocks using edge detection. 
While the absolute signal strength varies significantly across 
the frequency, the rising/falling edges are easier to detect. 


noises in the PSD map before trying to locate edges. This 
technique has been sufficient in our experiments without 
using sophisticated smoothing algorithms like [5]. 

Second, we apply search-based edge detection [14] 
and measure the edge strength by the first-order deriva- 
tive of the PSD map. Let P(k) represent the energy value 
of a spectrum section with index k, and let P’(k) repre- 
sent its first-order derivative. To decide whether edges 
are present, we choose a detection threshold Page. If 
P'(k) > Tedge then k has a rising edge and if P’(k) < 
—L edge then k has a falling edge. A frequency block 
with a rising edge to its left and a falling edge to its right 
is declared as busy and the rest as free. 

Compared to the energy detector, the edge-detection 
based sensing is less dependent on the choice of detec- 
tion threshold. As shown in Figure 2, while the absolute 
signal strength varies significantly over the frequency, 
the rising/falling edges are easy to detect. This design 
works well in OFDM-based systems where the PSD map 
can capture frequency usage accurately and on-demand. 
While other forms of interference such as wireless mi- 
crophones might not display the similar edges in the 
PSD map, we can incorporate other mechanisms such 
as feature detection based sensing for improved accu- 
racy [4,11]. In our target scenario, we focus on a homo- 
geneous setting with radios all using OFDMA and within 
a short distance, thus our proposed sensing mechanism 
works well. 


Calibrating Sender/Receiver Sensing Results. Each 
sender/receiver pair must synchronize their sensing re- 
sults to identify mutually available frequency ranges. 
While the sender must pause its transmission to sense 
spectrum, the receiver senses while receiving at no extra 
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cost. Therefore, the receiver constantly monitors spec- 
trum usage and locates occupied frequency ranges based 
on its maximum tolerable noise and interference level. 
When new spectrum blocks become available, it piggy- 
backs the information via data or control packets to sig- 
nal the sender to sense. 


3.2. Choosing Frequency Blocks 


After identifying mutually available frequency ranges, a 
sender/receiver pair needs to choose a frequency block 
to occupy. Such decisions usually occur when sessions 
start. It can also happen during a session when traffic 
changes cannot be handled by MAC rate adaptation. The 
device pair determines the amount of frequency needed 
based on estimated traffic demands and estimated MAC 
transmission rates on available frequency ranges. They 
can expand/shrink the current frequency usage, or move 
to a different frequency block. The ultimate goal is to 
obtain desired spectrum while maximizing system-wide 
usage efficiency. 

The frequency selection problem is analogous to the 
online task scheduling problem [19]. Due to the unpre- 
dictable dynamics of spectrum demands, optimal solu- 
tions are hard to find. Similar problems have also been 
studied extensively in the context of CPU, memory and 
storage allocations. The most efficient known solutions 
apply heuristics-based algorithms [19], which have been 
shown to perform very well in most cases. In particular, 
we consider the well-known best fit strategy that selects 
the smallest available frequency block that can accept 
the current spectrum request, the worst fit strategy that 
uses the largest available block, and the first fit strategy 
that uses the first large enough block. When no block 
is large enough to satisfy a session’s demand, we choose 
the largest block to accept the session partially. We found 
in our experiments that best fit outperforms others. 


Propagation-aware frequency selection. In some 
cases, radio propagation conditions differ significantly 
across frequency ranges, i.e. due to channel fading. In- 
formation on received signal and interference strength, if 
available, can be integrated into Jello’s frequency selec- 
tion algorithm to select high-quality blocks that provide 
better reliability and higher bandwidth [4,23]. In the cur- 
rent Jello prototype, the propagation quality is flat across 
the frequency span considered, thus the receivers use the 
measured interference strength in their selection process. 


4 Suppressing Spectrum Fragmentation 


Efforts to evaluate our basic Jello design reveal another 
fundamental challenge. Over time, as individual trans- 
missions enter and exit the network or adjust their spec- 
trum usage, available spectrum becomes increasingly di- 
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Figure 4: The impact of spectrum fragmentation with 4 
streaming media sessions, using VBR video traces from the 
ASU trace database [3]. We compare the basic Jello system 
(with fragmentation) to an oracle system that eliminates all 
fragments. 


vided into a collection of fragments (Figure 3(a)). This 
is because each radio must access spectrum contiguously, 
i.e. using a single frequency block. In this case, although 
a significant portion of spectrum remains unoccupied, it 
is effectively unusable because no individual fragment is 
large enough for a new request. A similar fragmentation 
problem appears in disk and memory allocation [19]. In 
this section, we examine the severity and impact of spec- 
trum fragmentation, and propose two distinct but com- 
plementary techniques to minimize it. We provide high- 
level descriptions of our proposed techniques and delay 
the detailed implementation issues to Section 5. 


4.1 Impact of Spectrum Fragmentation 


To understand the severity and impact of spectrum frag- 
mentation, we perform a detailed simulation using video 
traces from an online database [3]. Using a number of 
frame traces of H.263 video sessions, we simulate a sce- 
nario of multiple media sessions within close proximity. 
We measure the impact of spectrum fragmentation by 
application disruption rate, defined as the percentage of 
time a session cannot obtain enough spectrum to support 
X % of its present traffic demand. 

We compare two possible frequency access systems: 
(1) an oracle system that rearranges sessions’ frequency 
usage to defragment the spectrum completely; and (2) a 
basic Jello system where sessions “claim” their needed 
spectrum when they start, and do not change frequencies 
unless their spectrum demands change. 

Figure 4 plots the application disruption rate for 4 vari- 
able bit rate (VBR) video sessions for _X = 90%. On the 
X-axis, we show the ratio of the total average traffic load 
of all 4 videos to the spectrum capacity. Clearly, the ora- 
cle system that fully defragments the spectrum performs 
significantly better when the 4 videos present a signifi- 
cant portion of all available spectrum. To guarantee that 
the disruption rate never rises above 3%, the basic sys- 
tem can only support traffic equal to 67% of the total 
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Figure 3: Spectrum fragmentation and ways to mitigate its impact. (a) When sessions share spectrum by accessing contiguous 
frequency, they can create spectrum fragments. (b) After S3’s self defragmentation, the same spectrum can now support more 
spectrum requests. (c) Session S4 uses two spectrum fragments for a single transmission. 


spectrum capacity, while the oracle system can support 
traffic up to 83% of the spectrum capacity*. This is a sig- 
nificant boost in allowed traffic volume, and underlines 
the significant impact that fragmentation has on system 
performance. 


4.2 Online Spectrum Defragmentation 


The above results motivate us to improve Jello’s basic 
design to suppress spectrum fragmentation. The first and 
most direct solution is to perform online defragmenta- 
tion. A naive strawman version is to periodically per- 
form global defragmentation where all sessions pause 
their transmissions, rearrange their frequency usage so 
that unoccupied frequency blocks are merged into a large 
contiguous range. This, however, is infeasible in our 
problem context because we must minimize disruptions 
to ongoing traffic flows. There is also no central con- 
troller to perform global defragmentation. 

Instead, we propose an online, distributed approach 
to defragmentation: ongoing transmissions periodically 
consider moving to an alternative spectrum block using 
the best-fit algorithm to optimize overall spectrum avail- 
ability. Each sender/receiver pair periodically senses lo- 
cal spectrum usage, and if possible, coordinates to switch 
to a frequency block that better optimizes the overall 
spectrum availability. For example, Figure 3(b) follows 
our earlier scenario where a session S» terminates and 
leaves a spectrum fragment. If S3 voluntarily moves 
to spectrum block 2-3, the new request S4 can be ful- 
filled and the overall spectrum utilization increases. Fi- 
nally, using spectrum sensing to identify unoccupied fre- 
quency ranges, each device pair independently defrag- 
ments spectrum without coordinating with other pairs. 


Cost. The cost of our online defragmentation includes 
(1) the sensing and coordination overhead spent by de- 
vice pairs to identify unoccupied spectrum and rearrange 
their frequency usage, and (2) possible conflicts when 


“The oracle cannot support 100% traffic load because the flows are 
VBR and the peak load occasionally exceeds the spectrum capacity. 
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two sessions simultaneously defragment and make con- 
flicting frequency adjustments. 

Focusing on minimizing disruptions to ongoing ses- 
sions, Jello uses the following mechanisms to minimize 
defragmentation cost: 


e Minimizing Sensing/Coordination Overhead: To 
minimize sensing overhead, Jello receivers constantly 
monitor spectrum to identify possibly opportunities for 
defragmentation. They signal their senders to per- 
form sensing only after identifying possible opportuni- 
ties themselves. To minimize coordination delay, each 
sender/receiver pair uses their present frequency block 
to exchange handshakes and schedule frequency adyjust- 
ments. At initialization or during a unlikely event of 
lost synchronization or link failure, Jello devices enter 
a SYNC state to recover and resume communications. 


e Avoiding Defragmentation Conflicts: Multiple de- 
vices can simultaneously detect a defragmentation op- 
portunity and make conflicting frequency adjustments. 
Jello minimizes such conflicts by randomizing defrag- 
mentation efforts to avoid simultaneous adjustments. 


4.3. Non-contiguous Frequency Access 


Our second solution is to enable radios to combine 
multiple spectrum pieces to form a single transmission. 
Shown in Figure 3(c), S4 now combines frequency block 
2 and 4 together in a single transmission as if it uses a 
single frequency block. 

Non-contiguous frequency access is now widely 
used in centralized wireless networks such as WiMAX 
and cellular LTE systems. It is implemented in the 
form of Orthogonal Frequency-Division Multiple Ac- 
cess (OFDMA). Existing designs, however, require 
global synchronization, and fail when applied to dis- 
tributed networks without global synchronization. A key 
contribution of Jello is to identify and address the chal- 
lenges of implementing distributed OFDMA and to pro- 
totype our design on USRP GNU radios. 


Cost. To minimize interference, frequency guard bands 
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must be placed at link boundaries [17]. Guard bands are 
not usable for transmissions, and are essentially spec- 
trum overhead. Frequency guard bands are not an ar- 
tifact of non-contiguous frequency access: they are re- 
quired for contiguous access including 802.11 channels 
(which use 16% of frequency bandwidth as guard bands). 
On the other hand, the amount of guard bands increases 
when links start to use non-contiguous frequency blocks. 


4.4 The Case for a Unified Approach 


With the above two solutions, we ask the question: “Js 
one solution sufficient enough to address spectrum frag- 
mentation?” 

First, consider a scenario where only online spectrum 
defragmentation is available. While this technique im- 
proves spectrum utilization overall, each sender-receiver 
pair is acting independently, and cannot disturb other on- 
going transmissions. Therefore, only a limited level of 
spectrum defragmentation is possible, and this technique 
cannot achieve the same effectiveness as a global, syn- 
chronized defragmentation approach. Thus a low level 
of fragmentation might remain. 

Next consider a network using noncontiguous fre- 
quency access, but no online defragmentation. While this 
technique allows devices to utilize spectrum fragments 
as if they were a single contiguous fragment, it comes 
at the cost of multiple guard bands between link bound- 
aries. Without online defragmentation, spectrum frag- 
mentation will continue to degrade over time. Spectrum 
lost to guard bands will continue to increase, lowering 
overall spectrum utilization. 

Clearly, neither technique by itself can fully address 
the challenge of spectrum fragmentation. Together they 
form a more complete solution. Online defragmentation 
limits spectrum fragmentation to a low level, and non- 
contiguous access makes all of the spectrum available 
without incurring significant overhead to guard bands. 


5 Implementing Jello 


We have implemented Jello on USRP GNU Radios. 
Despite having limited frequency bandwidth and large 
processing delays [12], USRP radios are widely avail- 
able and fully reconfigurable across various protocol lay- 
ers. We use the USRP implementation as a “proof- 
of-concept” evaluation of Jello. We modified GNU ra- 
dio software to implement spectrum sensing, distributed 
contiguous and noncontiguous frequency access, online 
defragmentation, and sender/receiver coordination. 
Figure 5 presents a high-level structure of Jello. At 
the physical layer, each Jello device operates on non- 
contiguous frequency ranges using distributed OFDMA. 
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Figure 5: Jello system architecture. 
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At the MAC layer, Jello devices sense spectrum to iden- 
tify usable frequency, and adapt their frequency usage 
when the application demand changes or when an op- 
portunity to defragment appears. We now describe our 
implementation in detail. 


5.1 Physical Layer 


At the physical layer, Jello’s key contribution is to imple- 
ment spectrum sensing and distributed frequency access, 
both contiguous and non-contiguous, on today’s com- 
mon off-the-shelf hardware. Jello implements frequency 
access using OFDMA, which partitions the spectrum 
span into many small subcarriers. OFDMA has been 
widely used in centralized systems such as WiMAX, 
which divides a 20MHz frequency range into 2048 sub- 
carriers of 1OKHz each. Each sender can transmit on 
any subset of the subcarriers, either contiguously or non- 
contiguously aligned in frequency. Each receiver can lis- 
ten to the entire set of subcarriers at once. Simultaneous 
transmissions can occur at different subcarriers without 
interfering with each other. 

Implementing OFDMA on distributed networks, how- 
ever, is hard. Existing designs in centralized networks 
rely on global synchronization to maintain subcarrier 
orthogonality, so that transmissions on isolated subcar- 
riers do not interfere with each other. In distributed 
networks, where global synchronization is infeasible, 
OFDMA transmissions fail. To understand the causes, 
we perform an experiment by configuring 4 links on dif- 
ferent frequency subcarriers. Our results show that sig- 
nificant link failures occur. The failures are not caused 
by the inherent propagation impairments, but by the fol- 
lowing two reasons: 


(1) Unable to detect packet preamble: In many cases, 
the receivers cannot detect any preamble that marks the 
beginning of a packet. This is because OFDMA de- 
tects preambles using a time-domain “delayed correla- 
tion” property from a signal placed at the head of each 
packet [29]. Because preambles from multiple transmis- 
sions are no longer synchronized, the delayed correlation 
property no longer holds in time-domain signals, pre- 
venting any successful preamble detection. 
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Figure 6: An example of Jello’s flexible distributed spectrum access, implemented on USRP GNU radios. 
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transmissions access and share radio spectrum in the frequency domain. Among them, link 3 operates on two non- 


contiguous spectrum blocks to form a single transmission. 


(2) Unable to decode data packet: Even after fixing the 
preamble detection, significant losses still occur during 
packet decoding. This is because while multiple trans- 
missions operate on different subcarriers, they leak en- 
ergy to adjacent subcarriers, creating inter-carrier inter- 
ference and destroying the subcarrier orthogonality at re- 
ceivers. Compared to the preamble, packet data is much 
more vulnerable to interference because it is sent fewer 
error protections. 


This motivates us to design receivers that “filter” out or 
minimize unwanted signals to restore the desired trans- 
mission properties. With this concept in mind, we pro- 
pose two new mechanisms on top of the conventional 
OFDMA design to restore successful transmissions in 
distributed networks. 


Restoring Preamble Detection. To restore the de- 
lay correlation property required for preamble detection, 
we apply an adaptive filter at receivers to remove signals 
from unwanted subcarriers. To support non-contiguous 
frequency access, we use a multi-band filter bank. Given 
the knowledge of the subcarriers used by its transmitter, 
the receiver first applies a low-pass filter to eliminate sig- 
nals outside of its lowest and highest indexed subcarriers, 
and then uses multiple band-stop filters to remove signals 
from other unwanted subcarriers within the range. This 
design allows receivers to adapt filter ranges on-the-fly. 


Without global synchronization, devices also experi- 
ence frequency offset [13], defined as the frequency skew 
between devices’ central carrier frequency. The pres- 
ence of frequency offset could lead to errors in signal 
filtering. To suppress its impact between sender/receiver 
pairs, Jello receivers dynamically adjust their carrier fre- 
quency and filter width based on the result of preamble 
detection. At initialization it starts from a loose filter and 
gradually shrinks the filter to suppress interference. If 
the filter becomes too tight and fails to detect any pream- 
ble in a period, the receiver expands the filter to capture 
more subcarriers. After each successful preamble decod- 
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ing, it estimates the frequency offset from its sender and 
refines the filter parameters. 


Restoring Reliable Packet Receptions. While the 
use of receiver filters significantly improves preamble 
detection, packet losses can still occur due to out-of- 
band emissions among transmissions [17]. This work 
also shows that placing frequency guard bands between 
transmission boundaries is the most effective solution. 
To minimize these overheads, Jello devices directly mea- 
sure interference power levels from the PSD map, and 
avoid using severely affected frequencies. This tech- 
nique, combined with the adaptive filtering, allows Jello 
devices to correctly determine and minimize the usage of 
guard bands. 


GNU Radio Implementation. We implement Jello’s 
distributed OFDMA at 2.38GHz on a spectrum band of 
500kHz. We use 256 subcarriers (or frequency sections), 
each of size 1.953kHz. To carry adequate signals for 
reliable preamble detection, each transmission must use 
at least 28 subcarriers, which can be non-contiguously 
aligned. We implement the receiver filter using the ham- 
ming window approach [22]. To compensate the fre- 
quency offset between sender and receiver, we initially 
extend the filter by 5 subcarriers and then adjust its cen- 
tral frequency and width on-the-fly. We found in our ex- 
periments that adding the receiver filter helps to reduce 
the amount of guard bands. Overall, placing 2 subcarri- 
ers at each link boundary is sufficient to protect all the 
links in our experiments. 

Figure 6 illustrates an example PSD map of a system 
with three links. In this example, both link | and 2 oc- 
cupy a contiguous block while link 3 utilizes two blocks 
simultaneously to build a high bandwidth transmission. 
Small guard bands were placed at link frequency bound- 
aries to minimize cross-link interference. 

We implement the spectrum sensing directly over 
OFDMA. Each device performs the Fast Fourier Trans- 
form (FFT) on collected frequency signals, and averages 
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the results over 50 OFDM symbols’ to produce a PSD 
map. It computes the first-order derivative and uses a 
threshold of Peage =5dB to locate edges. We chose these 
parameters because they work well in our experiments. 


5.2 Access Layer 


At the access layer, each Jello device will select fre- 
quency blocks to set up its communication session. Dur- 
ing the session, it adapts its frequency usage when its 
traffic demand changes or when an opportunity for de- 
fragmentation appears. Without any dedicated radio for 
control, Jello addresses the following challenges: (1) 
each sender/receiver pair needs to synchronize on their 
frequency usage to ensure reliable transmissions; (2) 
to avoid hidden terminal problem, each sender/receiver 
pair needs to coordinate and choose proper frequency 
block(s) that are available to both of them; (3) simul- 
taneous transmissions need to avoid using overlapping 
frequency blocks; and finally (4) devices must be able to 
quickly recover from failures caused by channel impair- 
ments and external interference. 


Synchronizing Sender/Receiver. Each Jello sender 
and receiver pair performs handshaking to synchronize 
the frequency blocks they use for data transmission. 
This coordination has low overhead and does not involve 
any contention among sessions. Because GNU radios 
have large processing delays [12], our current Jello im- 
plementation does not include per-packet acknowledge- 
ments. The handshaking process is always initiated by 
the sender. 

To change a session’s spectrum usage, the sender per- 
forms spectrum sensing to see if there is any opportunity 
for change. If so, it sends a request (REQ) to its receiver 
indicating its spectrum sensing results. After receiving 
a REQ, the receiver selects a proper set of blocks and 
replies with an acknowledgement (ACK) indicating the 
selection. It also starts to decode signals from the new 
blocks. Upon receiving an ACK, the sender configures 
its transmissions on the new blocks. ACK failures could 
lead to discrepancy between sender and receiver’s fre- 
quency usage. Thus, after failing to decode packets for 
a period of T'porr, the receiver “switches” back to de- 
coding on the original blocks. 


Choosing Frequency Blocks. Each Jello pair first tries 
to find a contiguous frequency block using the best-fit al- 
gorithm. If no such block is available, the pair selects 
multiple frequency blocks following the “noncontiguous 
best-fit” strategy: select the largest available blocks un- 
til the remaining demand is less than the largest remain- 


'The typical OFDM symbol duration for 802.11 a/g radios is 4ys, 
so the sensing time is 0.2ms. In GNU radios, the symbol duration is 
2ms and the sensing time is 100ms. 


ing available blocks; then use best-fit to choose the final 
block. This approach minimizes the number of blocks 
required for the session. 


Avoiding Conflicts. § When an opportunity to defrag- 
ment spectrum appears, multiple device pairs could react 
simultaneously, thus leading to frequency adjustments 
that conflict. To minimize these conflicts, we incorporate 
a random delay to both the sender’s sensing function and 
receiver’s defragmentation triggering. First, upon detect- 
ing a defragmentation opportunity, the receiver waits for 
a random interval T#, .., and notifies the sender only 
if the opportunity still exists. Second, a sender always 
repeats its spectrum measurement after a random delay 
of T,,,,-. A frequency block is considered free only if 
it is found to be free during both measurements. Ran- 
dom backoffs reduce the probability of simultaneous de- 
fragmentation attempts, similar to the CSMA backoffs in 
802.11. Finally, devices can configure their backoff win- 
dows based on the projected effectiveness of their fre- 
quency shifts, giving priority to those that can provide 
the maximum benefit to the system. For simplicity, Jello 
uses a uniform random backoff window. 


Recovering from Failures. Despite minimizing link 
failures through careful coordination of spectrum sensing 
and selection, link failures are sometimes unavoidable. 
They can occur from external interference or an unlikely 
conflict scenario where two links simultaneously move to 
the same frequency block. Redundancy techniques such 
as error correction codes [24] can improve the robustness 
of coordination packets, but are ineffective under com- 
plete link failures. If a link fails due to interference or 
conflict, its sender-receiver coordination messages will 
also fail to reach their destinations. 

To address this, Jello introduces a SYNC state that de- 
vices enter at initialization or when they detect a coor- 
dination failure. A sender enters the SYNC state after 
failing to receive any ACK after retransmitting a REQ 
Ng times, and a receiver enters the SYNC state after 
not receiving any packets for a time period T'sy yc. In 
the SYNC state, devices communicate on the “SYNC 
Frequency Set” (SCS), a set of frequency blocks dedi- 
cated for performing resynchronization. The sender and 
receiver perform normal handshakes to reestablish syn- 
chronization and move to selected frequency block(s). 
There are several ways to define SCS, in our current im- 
plementation we configure it as a preassigned frequency 
block known to all devices. Devices try to avoid using 
the SCS for data transmissions except as a last resort, 
maximizing the probability that the SCS is idle. 


GNU Radio Implementation. We implement Jello’s 
access layer as a user-level program. Because USRP 
radios have a large random processing delay up to 
20ms [12,21], we use relatively large timing parame- 
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ters in our experiments: T>,,,. and TZ, ,. are uniformly 
distributed in [0.1s,1s], Tporr = 1s, Ns = 5, and 
Tsync = 3s. Each Jello device tries to defragment 
the spectrum once every 10s. We choose the 28 low- 
est indexed subcarriers (out of 256) as the SCS. Based on 
our experience with the experimental platform, we found 


these to be reasonable parameter values. 


5.3. Unexpected Hardware Artifacts 


We also observe two unexpected hardware artifacts that 
may affect Jello’s testbed performance. 


Amplified Impact of Frequency Offsets. The band- 
width limitation of USRP radios magnifies the impact 
of frequency offsets, because they are now larger than 
the subcarrier width. Our 20-day measurements show 
that the frequency offsets can reach 1OKHz (&5 subcar- 
riers) but have relatively smaller variances (<2 subcarri- 
ers). To suppress its impact, we manually correct each 
USRP’s central frequency by its measured average, re- 
ducing its frequency offset to < 2 subcarriers. 


Artificial Signals. | Due to imperfect RF shielding, a 
USRP radio may leak energy to its receiving path, cre- 
ating a energy peak of random strength near the central 
frequency (shown in Figure 6 as a spike near 2.38GHz). 
As a result, a radio could mistake some free subcarri- 
ers as being occupied. In our experiments this artifact 
leads to a small amount (<2%) of spectrum sensing er- 
rors. The impact is minor because Jello uses temporal 
averaged signals in its sensing, reducing the peak’s edge 
strength to that below the detection threshold. 


6 Evaluation 


We evaluate Jello using both network simulations and 
GNU radio experiments. We use simulations to evaluate 
Jello with various design choices and network configura- 
tions. We also run experiments on an indoor network of 
8 GNU radios in a 12mx7m room (Figure 7), running 4 
simultaneous media sessions. We configure each radio’s 
transmit power so that each link maintains 5% or less 
packet loss when there is no interference present, and all 
links interfere with each other. Each GNU radio experi- 
ment lasts 10 minutes and is repeated 5 times. 
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Figure 7: Our Jello testbed: 8 USRP GNU radios are placed 
ina 12m xX 7m room with various walls and furniture. 
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We use both VBR video traces and synthetic On/Off 
traffic to generate sessions. We scale the traffic flow as 
necessary to create a desired load normalized by the fre- 
quency bandwidth. Sessions carrying video traffic have 
similar average loads, and sessions carrying On/Off traf- 
fic have different traffic volumes. For video traces, we 
assume a 10s application buffer so that each session’s 
demand changes every 10s. For synthetic traffic, the On 
and Off periods are randomly generated from a uniform 
distribution. Each session determines the amount of fre- 
quency required based on its traffic demand and the av- 
erage data rate achievable on each frequency subcarrier. 
If the current available frequency cannot fulfill the entire 
demand, the session will take what is available. 

We evaluate Jello by comparing four systems: 


e Static: partitioning spectrum equally by the number of 
sessions; each session has a dedicated frequency block. 


e Jello-C: Jello with contiguous frequency access. 
e Jello-NC: Jello with non-contiguous access enabled. 


e Optimal: an “oracle” solution with perfectly accurate 
sensing that removes fragmentation by assigning spec- 
trum using knowledge of all future requests. 


We collect two performance metrics that measure ap- 
plication performance and spectrum usage efficiency: 


Application disruption rate: the proportion of time 
that a session experiences packet losses higher than a 
maximum threshold X, and thus cannot sustain satis- 
factory media quality. For example, prior work shows 
that streaming video sessions can only tolerate up to 
10% packet loss [26]. We examined Jello using X = 
5,10, 20%, and arrived at similar conclusions. Due to 
space limitations, we only show results using _Y = 10%. 


Residual usable spectrum: given a traffic load, the 
amount of spectrum left for a new media session, aver- 
aged over time and normalized by the spectrum capacity. 


6.1 Simulation Results 


We first simulate Jello under general network configura- 
tions. We use these experiments to examine and verify 
Jello’s design concept without any sensing/transmission 
error or coordination overhead. Figure 8 shows that 
Jello-NC, by enabling dynamic non-contiguous fre- 
quency access, significantly outperforms Jello-C and 
Static. There is a small distance between Jello-NC and 
Optimal because Jello-NC uses periodic defragmentation 
so that a low-level of fragmentation still remains, leading 
to some loss in spectrum from frequency guard bands. 
We also make several key observations: 


Impact of / (the maximum # of frequency blocks each 
radio canuse). Since hardware complexity scales with 
k, it is interesting to understand its impact. In Figure 8 
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(b) Impact of frequency selection algorithms 


Figure 8: Simulated Jello performance using video traces: (a) 
when allowing each radio to access k = 1..4 frequency blocks; 
(b) when using different frequency selection algorithms. 


Normalized average traffic load 


0.6 0.7 0.8 0.9 1 
PC block) | 98.9% 87.7% 70.3% 58.2% 52.1% 
P(2 blocks) | 1.1% 11.8% 25.1% 31.1% 33% 
P(3 blocks) | 0 0.4% 4.2% 9.1% 12.2% 
P(4 blocks) | 0 0.1% 0.4% 1.5% 2.4% 


Table 1: Probability distribution of the number of frequency 
blocks each session uses, using Jello-NC with k = co. 


we examine the application disruption rate of Jello-NC 
by varying k between | and 4. We see that raising k 
from | to 2 leads to a significant performance leap, but 
after that the benefit of raising k becomes marginal. To 
further examine this, we list in Table 1 the probability 
distribution of the number of frequency blocks each ses- 
sion uses when k is unlimited. We see that the need for 
non-contiguous access does increase with the traffic load, 
but each session uses no more than 3 blocks with a 97+% 
probability. We repeated our experiments using different 
traffic models and network sizes, and arrived at a simi- 
lar observation. Although inconclusive, this shows that 
adding 1 or 2 bands to a radio’s frequency access capabil- 
ity will significantly boost the overall performance. We 
also prove this trend analytically in a separate study [6]. 


Impact of the frequency selection algorithm. _ Fig- 
ure 8(b) plots the application disruption rate of Jello-C 
with Worst Fit, First Fit, Best Fit, and Jello-NC with Best 
Fit. We see that Jello-C with Best Fit outperforms Jello- 


C with the rest but only slightly. In our testbed experi- 
ments, we use Best Fit for Jello. 


Impact of network topology. |§ We examine this im- 
pact using a network of 50 sessions. By varying the 
transmit power we create networks of different conflict 
conditions, represented by the average conflict degree 
D. Higher D means each session conflicts with more 
peers. Table 2 lists the application disruption rate for 
Jello-C, Jello-NC and Optimal. The same conclusion ap- 
plies. One interesting observation is that Jello-NC leads 
to more gains as the conflict level decreases. This is be- 
cause non-contiguous access provides more opportunity 
for spatial reuse where non-conflicting sessions can reuse 
the same frequency blocks. 


Average conflict degree D 
9 a0 ico 7.5 8.5 


Jello-C | 0.025 0.057 0.107 0.155 0.207 
Jello-NC | 0.005 0.019 0.054 0.095 0.147 
Optimal | 0.001 0.005 0.018 0.037 0.065 


Table 2: Application disruption rate with different conflict de- 
grees using a large network of 50 sessions. 


6.2 Testbed Results 


We now evaluate Jello using the GNU radio testbed. All 
the results now include the impact of channel impair- 
ments, but those of Jello-C and Jello-NC also include the 
impact of coordination protocol overhead and spectrum 
sensing errors. For Jello-NC, we use k = 3 in our hard- 
ware implementation. 


6.2.1 Jello’s Overall Performance 


Media Quality Measurements. Figure 9 summarizes 
the application disruption rates using both video and syn- 
thetic traffic. Due to channel impairments, all disruption 
rates are slightly higher than those of simulations. We 
see that Jello-NC can effectively utilize a large portion of 
the spectrum (up to 75%) while keeping disruption rates 
below 5%. It outperforms Static and Jello-C significantly 
and is within a reasonable distance from Optimal. 

Jello-C also outperforms Static, except in the VBR 
case when the traffic load is lower than 68%. This 
unexpected degradation comes from Jello’s coordina- 
tion overhead, hardware artifacts (discussed in Sec- 
tion 5.3) and sensing errors (recall that Static has no 
such overhead). As the traffic load grows, the gain of 
dynamic spectrum multiplexing overcomes the system 
overhead. For On/Off traffic, Jello-C consistently outper- 
forms Static. This is because traffic burstiness is higher 
than that of the VBR traffic, thus dynamic spectrum ac- 
cess leads to significant gains. 
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Figure 9: Testbed results: application disruption rate vs. aver- 
age traffic load. Jello-NC consistently outperforms Jello-C and 
Static, and is within a small gap from Optimal. 


Spectrum Usage Efficiency. As another measure of 
Jello’s spectrum usage efficiency, we measure the resid- 
ual usable spectrum as a function of the normalized aver- 
age traffic load. Figure 10 shows the results for Jello-C, 
Jello-NC and Optimal. The result of Statis is not shown 
because the entire spectrum is used by existing sessions. 


Compared to Optimal which completely removes all 
fragments, Jello-NC only sacrifices 10-15% of the to- 
tal spectrum bandwidth. Among those, 3% comes from 
the extra guard bands associated with the non-contiguous 
frequency access (due to infrequent defragmentation), 
and the rest is from sensing errors and the fact that each 
new flow can only at most 3 frequency blocks. 


For Jello-C, however, the overhead increases to 20- 
30% of the total spectrum bandwidth. In this case, the 
impact of residual fragmentations is amplified by the 
limitation that each new flow can only use | frequency 
block. For the same reason, its residual spectrum is in- 
sensitive to variations in traffic loads. An alternative way 
to interpret the results is that, compared to Jello-C, Jello- 
NC offers up to 45% more free spectrum to new sessions. 


6.2.2 Where Does The Gain Come From? 


The improvement of Jello comes from both non- 
contiguous spectrum access and online defragmentation. 
In the following, we evaluate their gains separately by 
comparing the performance of Jello with contiguous and 
non-contiguous frequency access, and by enabling and 
disabling online fragmentation. 
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Figure 10: Testbed results: comparing Jello-C, Jello-NC and 
Optimal in terms of the residual usable spectrum. 


Benefits from Non-contiguous Frequency Access. 
For a fair comparison, we assume both access mech- 
anisms use online defragmentation. Figure 9 already 
shows that allowing non-contiguous access keeps the dis- 
ruption rate below 10%, while contiguous access may 
suffer more than 25% disruptions. Another way to in- 
terpret the result is that, to keep a 10% or less disruption, 
non-contiguous access achieves 22—32% improvement in 
spectrum utilization over contiguous access. 


Benefits of Online Defragmentation. Using On/Off 
traffic, we compare the performance of Jello with and 
without online spectrum defragmentation. From Fig- 
ure 11, we see that online defragmentation reduces spec- 
trum disruptions for both contiguous and non-contiguous 
Jello. For example, with 68% load, defragmentation 
reduces disruptions from 18% to 15% for contiguous 
and 5% to 3% for non-contiguous access. Compared to 
enabling non-contiguous access, online defragmentation 
has a smaller gain. This is because in our implemen- 
tation, Jello devices defragment infrequently (at most 
twice per On period) due to hardware limitations. 


6.2.3 Jello’s Overhead 


Having examined Jello’s application-level and spectrum 
usage performance, we now look into the overhead that 
separates Jello-NC from Optimal. We quantify the im- 
pact of each element that contributes to Jello-NC’s appli- 
cation disruption rate. These include: (1) the inherent 
traffic dynamics where the total spectrum cannot sup- 
port all the sessions (the same applies to Optimal); (2) 
the frequency guard band overhead from non-contiguous 
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Figure 11: Testbed results: benefits of Jello’s online defrag- 
mentation using On/Off traffic. 
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Figure 12: Testbed results: breakdown of contributions in 
Jello-NC’s disruptions, using video traces. The impact of traffic 
dynamics is unavoidable and also applies to Optimal. 


frequency access (for being unable to defragment spec- 
trum completely); (3) the sensing error and coordination 
overhead caused by channel impairments, and (4) con- 
flicting defragmentation. Results in Figure 12 show that 
the guard band overhead has a relatively small impact 
compared to the other two, which confirms that Jello pro- 
duces a very low-level of spectrum fragmentation. The 
probability of defragmentation conflicts is 0.5% in our 
experiments and its impact is absorbed in the coordina- 
tion errors in Figure 12. 


Frequency Guard Bands. In Figure 13(a), we com- 
pare Jello-NC and Optimal in terms of their guard band 
overhead. Without any fragment, Optimal uses a fixed 
number of guard bands (6 subcarriers out of 240 usable 
subcarriers), or a 2.5% of overhead. For Jello-NC, the 
guard band overhead increases with the traffic load be- 
cause each session uses more frequency blocks to fulfill 
its demand. However, similar to the results in Table 1, 
in our experiments more than 85% of time a session uses 
only | or 2 frequency blocks. Thus the overall guard 
band overhead is less than 5% even at 80% traffic load. 


Spectrum Sensing Errors. Figure 12 shows that sens- 
ing errors could be a major contributor to the disruptions. 
In our current implementation, the average false positive 
(treating available blocks as occupied) and false negative 


(treating occupied blocks as available) rates are 5—10%. 
Figure 13(b) shows the results of two sample topologies. 
These errors are due to the time-varying channel im- 
pairments and heterogeneous signal strengths commonly 
found in indoor environments. 

On the other hand, Jello’s edge-detection based sens- 
ing is much more accurate than the energy-detector, and 
is relatively insensitive to the choice of detection thresh- 
old. To quantify this benefit, we plot in Figure 14 the 
detection false positive and false negative rates from 
energy-detection based sensing, using the same topolo- 
gies in Figure 13(b). We see that energy-detection sens- 
ing leads to much higher detection errors, and is highly 
sensitive to the choice of its detection threshold (-32dB 
for topology 1, -48dB for topology 2). In addition, it 
suffers from high false positives (e.g. 40%) in order to 
maintain a reasonable rate of false negatives (e.g. 10%). 


Coordination Overhead. The majority of Jello’s co- 
ordination overhead is due to links falling back to the 
SYNC state to resynchronize. In our experiments, these 
occur from external interference, or an unlikely conflict 
in frequency adjustments. From Figure 13(c), we see that 
the probability of entering SYNC is only 2-3%, and the 
average recovery time is 4-5s. Both the SYNC probabil- 
ity and the recovery time increase with the traffic load 
because as more sessions start to adapt frequency for 
additional spectrum, they create slightly more conflicts 
and more traffic on the SCS. Now a session could wait 
longer before starting resynchronization. However, be- 
cause links leave the SCS immediately after locating free 
spectrum, the SCS utilization stays low. 


7 Discussion 


We can extend Jello in the following directions. 


Integrating with Other MAC Functions. Due to 
USRP Radios’ large processing delay, current Jello im- 
plementation does not include several MAC functions. 
These include (1) rate adaptation (Jello uses BPSK); (2) 
channel-aware frequency selection (in our experiments 
the channel quality is flat across the frequency range due 
to limited bandwidth); (3) power control (we use uniform 
transmit power across all the subcarriers in use); and (4) 
packet retransmission. Using powerful radio platforms, 
Jello can add these functions. A key issue is to inves- 
tigate the interaction between Jello’s frequency selection 
and these functions and to jointly optimize them together. 


Optimizing Frequency Selection. Jello’s frequency 
selection algorithms focus on minimizing network-wide 
spectrum fragmentation and conflicts. Additional infor- 
mation about each spectrum section such as received sig- 
nal strength can allow Jello to choose a good set fre- 
quency blocks to achieve reliable transmissions match- 
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Figure 13: Testbed results: Examining Jello-NC’s overhead in terms of the frequency guard band overhead, sensing errors, and 


coordination delay. 








Topot: False Positive 
Topo1: False Negative 
Topo2: False Positive »-+--~ a 


Probability 














Energy Detector Threshold (dB) 


Figure 14: Testbed results: detection reliability of energy 
detection-based sensing as a function of its detection thresh- 
old. Compared to Jello’s edge-detector, it leads to much higher 
detection errors, and is highly sensitive to the choice of its de- 
tection threshold (-32dB for topology 1, -48dB for topology 2). 


ing its traffic demand and minimize frequency usage. 
Jello can also use this information to configure a proper 
amount of guard bands at link boundaries instead of us- 
ing a uniform configuration. An interesting issue is how 
to obtain such information reliably and efficiently. 


Porting Jello to Other Radios. Jello can be ported 
onto advanced hardware platforms [16, 21, 24, 28, 30], 
to benefit from their increased frequency bandwidth and 
processing speed. For best performance, Jello requires 
radios that can support fine-grained frequency access, 
and quickly scan spectrum to identify available ranges. 


$8 Related Work 


We divide the related work into two categories: contigu- 
ous and non-contiguous frequency access. 


Contiguous Frequency Access. The majority work 
on dynamic spectrum networks assumes contiguous fre- 
quency access [1,4, 8, 16, 20, 31,32]. This access pat- 
tern has the advantage of being readily implemented on 
conventional 802.11 devices [8]. In this context, prior 
works have developed centralized algorithms for load 
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balancing [20], distributed protocols for spectrum con- 
tention [32] and for utilizing UHF whitespaces [4]. 


Jello differs from these works in three aspects. First, 
Jello’s per-session FDMA design is more general in that 
it operates across wider spectrum ranges at a fine gran- 
ularity and completely eliminates CSMA traffic con- 
tention. Second, unlike [32], which requires a sepa- 
rate control radio to reserve spectrum, Jello devices self- 
sense spectrum to avoid access conflicts, and defrag- 
ment spectrum while staying transparent to others. As 
a result, Jello provides dedicated frequency usage to de- 
manding applications. Finally, Jello’s spectrum sensing 
differs from SIFT [4] which detects any contiguous fre- 
quency usage using time-domain signals. Instead, Jello 
uses wide-band sensing in the frequency domain that can 
quickly identify multiple active frequency blocks instead 
of single blocks at a time. 


Non-contiguous Frequency Access. Most works in 
this area assume either centralized control [23] or a ded- 
icated radio for control. Others are limited to simula- 
tions [7] without considering practical artifacts such as 
sensing and guard bands. Jello, on the other hand, imple- 
ments distributed non-contiguous frequency access and 
deploys a USRP prototype. 


SWIFT [24] is a distributed wideband spectrum access 
system that can use a large frequency band even when 
a narrowband signal is present. SWIFT nodes share 
spectrum in the time domain using CSMA. Jello differs 
from SWIFT by using per-session FDMA to avoid costly 
packet contentions and by using an non-intrusive edge- 
detection based mechanism to identify usable frequency. 
ODS [15] implements on-demand spectrum access using 
spread-spectrum codes, focusing on adapting spectrum 
allocation to bursty traffic. It applies a random policy for 
selecting codes and uses adaptive receiver feedback to 
regulate code allocations. Jello differs from ODS by op- 
erating in the frequency-domain, using spectrum sensing 
to avoid access conflicts. 


(S) 


Average Recovery Time 


USENIX Association 


USENIX Association 


9 Conclusion 


Jello provides a new distributed spectrum access tech- 
nique for demanding wireless applications. High-quality 
delay-sensitive media sessions can now access and share 
wireless medium in the frequency domain and adapt their 
spectrum usage to varying traffic demands. Jello uti- 
lizes frequency-agile radios to sense, identify and occupy 
unused spectrum, allowing multiple sessions to work in 
parallel on isolated frequencies. To maximize spectrum 
usage efficiency, Jello devices self-defragment spectrum 
on-the-fly, and scavenge multiple frequency fragments 
for use by single, high-speed transmissions. Jello is also 
MAC-agnostic and does not require any dedicated radio 
for control. Despite USRP radio’s limited bandwidth and 
large processing delays, our measurements on an 8-node 
testbed confirm that Jello can provide reliable spectrum 
access for media applications and significantly improve 
spectrum usage efficiency. 
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Contracts: Practical Contribution Incentives for P2P Live Streaming 


Michael Piatek* Arvind Krishnamurthy* Arun Venkataramani' 
Richard Yang® David Zhang” Alexander Jaffe* 
Abstract not yet been produced). While some recent proposals 


PPLive is a popular P2P video system used daily by mil- 
lions of people worldwide. Achieving this level of scala- 
bility depends on users making contributions to the sys- 
tem, but currently, these contributions are neither verified 
nor rewarded. In this paper, we describe the design and 
implementation of Contracts, a new, practical approach 
to providing contribution incentives in P2P live stream- 
ing systems. Using measurements of tens of thousands 
of PPLive users, we show that widely-used bilateral in- 
centive strategies cannot be effectively applied to the live 
streaming environment. Contracts adopts a different ap- 
proach: rewarding globally effective contribution with 
improved robustness. Using a modified PPLive client, 
we show that Contracts both improves performance and 
strengthens contribution incentives. For example, in our 
experiments, the fraction of PPLive clients using Con- 
tracts experiencing loss-free playback is more than 4 
times that of native PPLive. 


1 Introduction 


System collapse due to large-scale reductions in user 
contributions is a major concern for PPLive, which is one 
of the most widely deployed live streaming services on 
the Internet today, serving more than 20 million active 
users spread across the globe. Using peer-to-peer (P2P) 
as the core technique, PPLive achieves cost-effective live 
video distribution by providing a small amount of seed 
bandwidth to a few participants, with the rest of the dis- 
tribution being performed by users relaying data. Thus, 
the availability and scalability of PPLive depends cru- 
cially on the contributions of its users. 

The current PPLive design neither verifies nor rewards 
contributions, creating the potential for strategic users to 
restrict their contribution, degrading robustness. This is 
particularly true in environments where capacity is lim- 
ited or priced by usage. Furthermore, when developing 
an open live video streaming standard, relying on closed 
systems with proprietary protocols is not feasible. 

In this paper, we explore how to provide practical 
contribution incentives for P2P live streaming, using 
PPLive as a concrete example. Although incentives have 
been studied extensively in the case of widely deployed 
file-sharing systems (e.g., [1, 16, 22]), live streaming 
presents unique challenges. For instance, clients cannot 
be rewarded with faster downloads once they are receiv- 
ing data at the broadcast rate (since additional data has 
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have considered contribution incentives in a live stream- 
ing setting (e.g., [17]), they do not take into account sev- 
eral practical considerations of deployed systems, such 
as client heterogeneity and operation under bandwidth 
constraints. We provide an examination of live stream- 
ing incentives grounded in experience with a deployed 
and widely used live streaming system. 

We proceed in two steps. First, we use measurements 
of tens of thousands of PPLive clients to demonstrate 
quantitatively the challenges in adapting existing incen- 
tive strategies to the live streaming environment. We find 
that in practice, the majority of system capacity is con- 
tributed by a minority of high capacity users. As a result, 
incentive mechanisms that require balance between con- 
sumption and contribution will either exclude many users 
from participation or underutilize capacity substantially. 

More broadly, bilateral exchange mechanisms widely 
used in bulk data distribution, such as tit-for-tat in Bit- 
Torrent [5], are ineffective in the live streaming envi- 
ronment, in part because sequential block availability 
sharply limits trading opportunities between peers. Im- 
posing topologies that increase bilateral trading opportu- 
nities (e.g., [17]) increases the variance in block delivery 
delay, causing either increased playback deadline misses 
or increased startup delay. Tit-for-tat, for example, sub- 
stantially reduces performance when applied to PPLive. 

The second part of the paper describes Contracts, a 
new design for providing robust contribution incentives 
in live streaming P2P services. Contracts differs from 
existing techniques in two principal ways. First, Con- 
tracts is designed for the live streaming environment. 
Rather than relying on increased download rates, Con- 
tracts rewards contributions with increased quality of 
service when the system is constrained. Second, Con- 
tracts departs from traditional incentive mechanisms that 
rigidly constrain client behavior. Instead, we define a 
default contract specifying an agreement between an in- 
dividual client and the overall system (1.e., PPLive and 
other clients) as to how its contributions will be evaluated 
by others. To enable this, we introduce a lightweight pro- 
tocol that provides verifiable accounting of each client’s 
contributions. But, the contract does not mandate fine- 
grained behavior, leaving individual clients free to make 
local optimizations that increase efficiency. 

We have integrated Contracts with PPLive, and find 
that our implementation both improves performance and 
strengthens contribution incentives. For example, in our 
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experiments, the fraction of PPLive/Contracts clients ex- 
periencing loss-free playback is more than 4 times that of 
native PPLive, and clients that contribute more than oth- 
ers receive consistently higher quality of service. 

The remainder of this paper is organized as follows. 
Section 2 provides an overview of live streaming in 
PPLive. Sections 3 and 4 describe the challenges of ap- 
plying incentive strategies based on bilateral exchange to 
live streaming. These challenges motivate the design and 
implementation of Contracts, which we present in Sec- 
tion 5 and evaluate in Section 6. We discuss related work 
in Section 7 and conclude in Section 8. 


2 PPlive overview 


PPLive is a hybrid P2P system for streaming live and 
on-demand video. Clients are organized into channels, 
with members of a given channel redistributing video 
data to one another. Clients rely on two forms of infras- 
tructural support: 1) coordinating trackers that provide 
a rendezvous point for users watching the same channel, 
and 2) seed bandwidth provided by a group of broadcast 
servers that source all content. Multiple channels can 
be managed by a single tracker and sourced by a single 
broadcaster. Currently, PPLive maintains roughly 600 
public live channels daily. 

The wire-level details of the PPLive protocol are sim- 
ilar to existing swarming systems like BitTorrent [5]. 
Clients maintain a large set of directly connected peers 
(50-100) to which they advertise their local data blocks 
and issue requests for missing blocks. Each block is 4— 
16 KB and is discarded shortly after being played. 

Trackers maintain state that includes the set of clients 
in each channel, the properties of clients (e.g., reported 
bandwidth capacity and NAT status), and the overall 
health of channels. The health of an individual chan- 
nel is monitored by its broadcast source. Clients use the 
source as a peer of last resort. Only when a block cannot 
be obtained from any other peer is a request sent to the 
data source. Thus, the load on the broadcaster provides 
a metric for the health of a given channel relative to oth- 
ers. By shifting capacity from channels demanding less 
load to those servicing more requests, PPLive allocates 
infrastructure bandwidth automatically. 

Most relevant to our work is PPLive’s servicing policy. 
By default, each client contributes its full available ca- 
pacity and does not prioritize service for particular peers, 
1.e., download requests from a peer contributing little to 
the system and those from a peer making the highest con- 
tribution are treated equally. 


3 Limits of bilateral exchange 


The lack of contribution incentives means that PPLive’s 
scalability depends on clients’ good will and the faithful 
execution of its software. At present, these are largely 
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effective due to flat-rate network pricing and the com- 
plexity of PPLive’s proprietary implementation. Increas- 
ingly, however, the all-you-can-eat pricing model is giv- 
ing way to the realities of network management [6]. Fur- 
thermore, there is continued interest in developing an 
open live video streaming standard supporting multiple 
implementations (e.g., IETF 73 PPSP BoP). These trends 
motivate the explicit consideration of incentives for live 
streaming systems to reward good (and discourage bad) 
behavior by coupling performance and contribution. 

Why does live streaming necessitate revisiting the in- 
centive design problem? To appreciate this, consider the 
most widely-used class of incentive strategies in P2P sys- 
tems today—bilateral exchange—wherein a peer x deter- 
mines the amount of upload bandwidth to peer y based 
solely on the amount of bandwidth that client y uploads 
to x, independent of the total bandwidth that client y up- 
loads to all clients. Servicing policies based on bilateral 
exchange are compellingly simple. For example, tit-for- 
tat has been widely applied in bulk data distribution sys- 
tems (e.g., BitTorrent [5]), and has also been studied ex- 
tensively [16, 25]. More recently, bilateral exchange has 
also been proposed as a basis for providing incentives in 
live streaming systems (e.g., [17]). However, bilateral 
exchange schemes suffer from fundamental performance 
limitations in the context of live streaming. 

For bilateral exchange to work, peers need to have 
trading opportunities (1.e., distinct data blocks of mu- 
tual interest). When distributing bulk data, trading op- 
portunities are frequent. Each client seeks to acquire the 
entirety of a large set of blocks. Bulk distribution sys- 
tems typically use block selection strategies such as lo- 
cal rarest first (e.g., BitTorrent [5]) or network coding of 
blocks (e.g., Avalanche [9]) to ensure that all blocks hav- 
ing roughly equal trading value over time. Ideally, once 
a new client has received just a few random blocks, it is 
bootstrapped into the trading system. 

Live streaming differs radically from bulk data distri- 
bution in ways that significantly reduce the effectiveness 
of bilateral exchange. We consider four key challenges 
in live streaming that inform the design of Contracts. 

1) Heterogeneity: Capacity heterogeneity poses a fun- 
damental challenge to the efficiency of balanced ex- 
change schemes. Live streaming offers a common down- 
load rate—the stream playback rate—to all peers regard- 
less of their upload capacity. In practice however, peer 
Capacities can vary by an order of magnitude. We ver- 
ify this by measuring the capacity distribution of PPLive 
clients using logs of the reported bandwidth capacity of 
99,184 clients. The distribution is highly skewed with a 
mean capacity (142 KBps) that is more than double the 
median (65 KBps). As a result of the skew, the majority 
of total aggregate capacity is provided by a minority of 
high capacity peers. The top 10% of clients account for 
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58% of total capacity. 

Capacity heterogeneity implies a discouraging trade- 
off between efficiency and balance. Insisting on near- 
perfect balance will either exclude many users that can- 
not support the stream rate or significantly underutilize 
capacity. Concretely, streaming at the average capac- 
ity of 142 KBps (the maximum possible) would would 
exclude 86% of PPLive clients in our trace when requir- 
ing balanced contribution and consumption. On the other 
hand, providing service to 95% of PPLive clients (with 
balanced exchanges) requires restricting the stream data 
rate to the 5th percentile of capacity at 21 KBps, which 
corresponds to an overall utilization of just 15%. 

The fundamental tradeoff between efficiency and bal- 
ance under skew can be quantified as follows. Let uw and 
O respectively denote the mean and variance of the up- 
load capacity distribution. We define the skew! S as o/u. 
For a stream rate of r, the efficiency E is r/u, where a fea- 
sible broadcast implies uw > r. We define the imbalance I 
as the deviation of peer upload rates with respect to the 
stream rate normalized by the mean, i.e., 1 Evinn ve 
where x; 1S peer 7’s upload rate (less than or equal to its 
capacity) and the sum indexes over all N peers. Note that 
all three of skew, efficiency, and balance lie between 0 
and 1. The theorem below captures the stated tradeoff, 
the proof of which is available in a technical report [24]. 


THEOREM 1 High efficiency and high skew imply high 
imbalance. Specifically, (a) If peers upload at a rate pro- 
portional to their capacity, I = E'-S. (b) For any feasible 
set of upload rates, I is bounded from below by a function 
that monotonically increases from 0 to S as E increases 
from 0 to 1. 


2) Limited bandwidth needs: In bulk data distribu- 
tion using bilateral exchange, the incentive to increase 
upload rate is a corresponding increase in download rate. 
In live streaming, however, once a client is downloading 
data at the rate of production, a further increase in down- 
load rate is not possible (as additional data does not yet 
exist). Although one may consider rewarding increased 
contribution with improved video quality (e.g., at higher 
resolutions using layered coding), PPLive avoids such a 
scheme due to its increased complexity, reduced video 
coding efficiency, and the need for substantially higher 
bandwidth to produce visually-discernible quality differ- 
ences. Thus, a challenge to incentivize users to con- 
tribute capacity in excess of their demands is to create 
a compelling reward with nonzero marginal utility. 

The above points do not rule out bilateral exchange 
schemes that are not balanced, which we consider below. 
3) Limited trading opportunities: Bilateral exchange 
depends on the existence of mutually beneficial trad- 
ing opportunities to evaluate peers. Unfortunately, live 


1 unlike the more standard definition based on the third moment. 
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Figure 1: The impact of distance from the broadcast 
source on bilateral exchange. Requiring balanced ex- 
change significantly limits trading opportunities as does 
distance from the source. 


streaming provides clients with limited opportunities for 
mutually beneficial trading. The key difference is that 
unlike bulk data distribution, where blocks have roughly 
equal value over time and among clients, the value of 
blocks in live streaming varies over time and client. A 
block has little value at a client if it is received after the 
playback point at the client. Thus, the data useful to an 
individual client is limited to a narrow range between 
the production point and the local playback point, 1.e., 
the lag. The smaller the lag that a system targets, the 
fewer the trading opportunities. Furthermore, blocks in 
live streaming emerge at the data source one at a time 
at the production rate unlike bulk data distribution where 
data becomes available all at once. As a result, clients 
closer to the data source in the topology have inherent 
advantages in receiving rare (new) blocks first, creating 
a perpetual trade imbalance with clients further from the 
source. Although trade imbalance does not necessarily 
rule out bilateral exchange schemes, it makes evaluating 
peers significantly more challenging. 

To make this concrete, we compare the number of 
trading opportunities for clients in a PPLive broadcast 
with 100 clients running on the Emulab testbed. Each 
client uses random block selection to maximize trading 
opportunities unless a block is near its playback deadline. 
Each client simultaneously joins a test stream and period- 
ically logs its buffer state during playback. Figure | sum- 
marizes the trading opportunities among pairs of peers 
taken from a snapshot of buffer states collected several 
minutes into the broadcast. Each individual client’s aver- 
age distance to the broadcast source is the average num- 
ber of overlay hops traversed by all of its received blocks, 
which correlates with lag. The number of trading oppor- 
tunities is shown in terms of the absolute difference in 
these averages for pairs of peers (x-axis) without con- 
straints and when a balanced number of mutually ben- 
eficial trades is required. These results show that the 
greatest opportunity for bilateral exchange is between 
peers that are at a similar distance from the broadcast 
source. But, such pairs of peers are in the minority. Most 
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Figure 2: Illustration for Theorem 2: Node j has higher 
upload capacity than node 7 but has fewer descendants. 


block transfer opportunities are between pairs of peers 
that have an imbalance of useful data. 

4) Delay sensitivity: Live streaming requires low dis- 
tribution delay so as to reduce lag (1.e., improve live- 
ness) or, equivalently, reduce the playback miss rate for 
a given lag. Minimizing block dissemination delays im- 
poses some structural requirements on the steady-state 
block distribution topology. We formalize this claim as 
follows. The dissemination topology traversed by a sin- 
gle block must be a tree as peers only request blocks they 
do not already have. Consider the dissemination tree T 
for a single block. Note that different blocks may have 
different dissemination trees, so a node may be at differ- 
ent distances from the source across blocks. We assume 
that in steady-state, the system can sustain the stream rate 
such that a block is never queued at a node behind an- 
other block.” 


THEOREM 2 Any topology in which a peer i has lower 
bandwidth than peer j but i has more descendants than 
J has higher average block delay than the topology ob- 
tained by swapping i and Jj if one of the following two 
conditions hold: (a) the topology is a balanced tree, or 
(b) iis an ancestor of Jj. 


Figure 2 illustrates the condition in the theorem above. 
The proof shows that, if either T is balanced or 7 is an 
ancestor of j, T can be transformed to a topology T’ with 
lower delay by simply swapping i and /. 

The structural requirement for low delays presents a 
design conflict for bilateral exchange schemes. Being 
closer to the source, a high capacity peer A is likely to 
receive newer blocks before a lower capacity child B, so 
A is unlikely to benefit from B. However, in bilateral ex- 
change, A evaluates B solely by B’s uploads to A, forcing 
B to try to upload to A even though that is detrimental to 
the average block delay. Note that bulk distribution does 
not face this predicament as it does not require individ- 
ual block delays to be low, a crucial consideration in live 
streaming. 


2The technical report [24] describes a procedure to construct such a 
dissemination tree packing. 
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Figure 3: Cumulative fraction of clients with a given 
block delivery rate for different topologies. Placing high 
capacity clients near the source improves quality. 


4 Structuring for performance 
and incentives 


The limitations of bilateral exchange lead us to pursue 
a fundamentally different approach to providing incen- 
tives in P2P live streaming systems. Instead of reward- 
ing higher upload rates with higher download rates, we 
craft a mechanism that incentivizes higher upload rates 
with robust playback quality; 1.e., fewer missed playback 
deadlines, despite churn and capacity constraints. As ob- 
served in measurement studies [20], in a streaming sys- 
tem at a given channel rate, robust playback quality is 
the key determiner of user satisfaction. Our design of 
the incentive mechanism is enabled by a pleasant coinci- 
dence that aligns performance and incentive objectives: 
high capacity peers must be close to the source to keep 
block delays low, and peers closer to the source experi- 
ence lower and more predictable block delays yielding 
better playback quality. 

Topology: To maximize utilization, high capacity clients 
need to be placed near the data source so that they can 
quickly replicate useful data. To demonstrate the im- 
pact of topology and capacity heterogeneity on play- 
back quality, we compare the block delivery rate for 120 
instrumented PPLive clients running on Emulab under 
three scenarios: 1) clients joining in a random order, 
2) high capacity clients joining first, followed by low ca- 
pacity clients, and 3) low capacity clients preceding high 
capacity. In each scenario, the over-provisioning of ca- 
pacity relative to demand is two. 50% of clients are as- 
signed an upload capacity equal to the stream data rate 
(low capacity) with the remaining 50% having capacity 
3x the stream rate (high capacity). The results are sum- 
marized in Figure 3. Playback quality is best when high 
Capacity peers join first, and are therefore closer to the 
data source. When high capacity peers join last, the qual- 
ity degrades significantly. With no change in total sys- 
tem capacity, the median delivery rate drops from 0.95 to 
0.75. In practice, the order in which clients join is likely 
to be random with respect to capacity, yielding playback 
quality in between the two extremes. 
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Figure 4: The fraction of blocks missing playback dead- 
lines as a function of distance from the broadcast source. 
Playback quality is best for clients nearest to the source. 


Buffering: —§= When blocks miss playback deadlines, 
PPLive takes one of two actions: 1) if only a few blocks 
are missing, they are skipped; 2) but, if several blocks 
miss their playback deadline, PPLive pauses while wait- 
ing for downloads to complete. This buffering policy is 
designed to handle temporary degradation in quality of 
service. A single missed block has limited impact on 
video quality, and rebuffering suffices to recover from 
more significant fluctuations. But, if a client chronically 
experiences misses, it will eventually fall so far behind 
its directly connected peers that required blocks are no 
longer available. In this case, users need to manually 
rejoin the broadcast. Restarting is a simple recovery 
mechanism and requiring it is an explicit design choice 
in PPLive that is consistent with typical user behavior. 


The buffering policy implies that the effects of quality 
degradation cascade. When a client near the data source 
Stalls, more distant clients to which it forwards data 
also experience service disruption. Although the PPLive 
mesh contains significant path redundancy, failover is not 
instantaneous and may require rebuffering. We quantify 
the quality of service in terms of distance from the broad- 
cast source in Figure 4. In this experiment, 127 clients 
with equal capacities (twice the stream rate) participate 
in a broadcast with one client joining every 10 seconds. 
Statistics are computed after all clients have been in the 
system for at least 20 minutes. As in previous experi- 
ments, we define the average distance of a client from 
the source to be the average number of hops traversed by 
all blocks received by the client. The results show that 
service quality degrades with distance from the source 
even in a Static setting. Introducing churn will further 
degrade service quality for clients that are further away 
from the source. 


5 Contracts 


We now describe Contracts, a new scheme to provide 
contribution incentives in P2P live streaming systems. 
Our scheme is based on two key design choices that are 
motivated by the considerations unique to live streaming. 


Contracts rather than bilateral reciprocation: Rec- 
ognizing that bilateral exchanges are ineffective for live 
streaming, we develop a scheme that rewards each peer 
according to its global effectiveness. We borrow from 
economic theory, in particular the principal-agent prob- 
lem, the idea of contracts—a method of structuring in- 
centives in asymmetric or non-bilateral settings [15]. In 
Contracts, a data provider grants a level of service pro- 
portional to the consumer’s ability to replicate the data 
further, as opposed to basing service simply on recipro- 
cal contributions. The contract is thus designed to moti- 
vate consumers to contribute their bandwidth and also to 
hold them accountable for their respective servicing de- 
cisions. Further, since a provider is also a consumer in a 
P2P setting, it should be in the provider’s self-interest to 
enforce the contract to obtain good service from its own 
providers. Put simply, a node’s incentives as a provider 
should be aligned with its incentives as a consumer. 
Global topology optimization: Instead of operating 
with an unstructured mesh, Contracts structures the over- 
lay topology globally to account for the heterogeneity 
of peer capacities. Specifically, we introduce mecha- 
nisms that allow clients to identify their positions rela- 
tive to the stream source and reorganize themselves, with 
high capacity peers percolating towards the source. Peers 
with disparate capacities develop asymmetric yet mutu- 
ally beneficial relationships: low capacity peers benefit 
from the replication capabilities of high capacity peers, 
while high capacity peers are rewarded with better qual- 
ity of service for their contributions. 

In this section, we outline the following: (1) the sin- 
gle global contract for evaluating both the quantity and 
quality of each peer’s contributions, (2) a default policy 
for updating the overlay topology, and (3) a wire-level 
accounting protocol for verifying contributions. 


5.1 Contribution contracts 


In Contracts, each peer is evaluated for both 1) the 
amount that it contributes to directly connected peers, 
and 2) the amounts those peers contribute in turn; 1.e., 
the quality of its selections. Peers with higher valua- 
tions will have a greater likelihood of being added to the 
peer lists of high capacity nodes and also enjoy prompt 
service when requesting individual blocks from those 
nodes. We next describe the details of how peer eval- 
uations are computed. 

Performance metrics: For two peers x and y, we de- 
note the contribution rate from x to y by B(x — y), and 
compute this using a weighted moving average. B(x) 
represents the total bandwidth contributed by node x. 
Each of these values is mapped to discrete classes of 
contribution—the deciles of the observed capacity dis- 
tribution from PPlive. We label the bandwidth class of a 
node x as BC(x). 
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Figure 5: Evaluating J for client 1. Contribution from 
] — 2 is weighted by the rates from 2 — 3, 4, 5. Contri- 
bution of 1 — 3 is weighted by the rates from 3 — 2, 6. 


To measure the effectiveness of contributions made by 
a client, we define /(x) to be the one-hop propagation of 
x’s contributions, calculating this as follows: 


I(x) = ¥° B(x > p) x Dpw(BC(p)) (1) 


p€peers(x) 


where Dgw € [0,1] is a weight specified by the cumula- 
tive distribution function of peer upload capacities. 

As an example computation of J, consider Figure 5. In 
this case, the effectiveness of contributions from node 
1 are being evaluated. The total contribution rates of 
peers 2 and 3 are 120 and 40, respectively. Mapping 
the values 120 and 40 to their bandwidth classes and 
looking them up in our measured capacity distribution 
yields: Dgw(BC(3)) = 0.1, Dgw(BC(2)) = 0.8. Sub- 
stituting these values allows us to compute /(1) = 30 x 
0.1430 x 0.8 = 27. 

Taken together, measures of contribution (BC) and ef- 
fectiveness (J) constitute the global evaluation function, 
V(x), which we define as the tuple |[BC(x),/(x)] with the 
following comparison operator: 


V(x) > V(y) = > BC(x) > BC(y) or 
BC(x) = BC(y) AI(x) > I(y). 


In other words, peers are compared by their bandwidth 
class first, and peers within a class are compared accord- 
ing to the effectiveness of their contributions. 

Servicing policy: The metrics defined above are used 
by clients to identify which peers are selected to receive 
service, the priority of that service, and which potential 
peers to prefer. We distinguish between connection and 
selection in our discussion of service policy. Connection 
is a prerequisite for being selected to receive service, and 
only connected peers exchange the control traffic neces- 
sary to compute V(-). 


e Peer selection: Each node periodically rank-orders its 
peers by their corresponding V(-) values, selects the 
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top k of these, where k is a configurable parameter, 
and notifies each that block requests will be accepted. 


e Block request servicing: Among peers with outstand- 
ing requests, each client prioritizes the request from 
the peer with the maximum B(-) value. 


In Section 5.3, we describe how each client reliably 
ascertains the performance metrics of its peers. Before 
doing so, we first analyze the incentive structure arising 
from this servicing policy. 

What are the incentives provided by the system? Con- 
tracts provides strong contribution incentives by linking 
quality of service to effective contribution. A peer in- 
creases its chances of being selected for service and its 
priority by increasing its upload contribution. It might 
appear that contribution incentives are weakened by the 
use of bandwidth classes, as a peer p can lower its contri- 
bution while still remaining in its class. However, doing 
so reduces its service priority for block requests, which is 
determined by B(p) among peers in its bandwidth class. 

Contracts also rewards peers for making globally ben- 
eficial contributions. A peer that transmits blocks to 
higher capacity peers will achieve a higher evaluation 
under /(-) (and hence V(-)), increasing the likelihood of 
being selected by others. 


Will a provider enforce contracts? One possible devia- 
tion is for the provider p to ignore the rank ordering of 
V(-) values when choosing peers. In this case, a deviat- 
ing provider selects a node y rather than x even though 
V(x) > V(y). This ordering implies either BC(x) > 
BC(y) or BC(x) = BC(y) AI(x) > I(y). In the former 
case, the provider’s deviation lowers its own I(-) value 
since its contributions to y will be weighted less than 
its contributions to x. Hence, the deviation is not in its 
self-interest. In the latter case, /(p) is unchanged by en- 
forcing the contract, and hence p does not benefit from 
deviating. 

Another possible deviation is for the provider to not 
provide prioritized service to higher capacity peers. For 
instance, a provider could transmit a block to y instead 
of x even though B(x) > By). Again, this is not in 
the provider’s self-interest since it reduces the provider’s 
evaluation under /(-). 


Why not other incentive structures? Initially, our defi- 
nition of V(-) may seem somewhat arbitrary. Why not 
make effectiveness (/) fully recursive? That is, by in- 
cluding the contributions of peers, peers of peers, and so 
on. And, why use bandwidth classes rather than band- 
width itself? Or, why not use bandwidth only and ignore 
effectiveness? We tackle each of these questions in turn 
to provide additional intuition behind the development of 
our global contract. 


We avoid a fully recursive definition of effectiveness 
for scalability. Accounting for the propagation of contri- 
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Figure 6: Under an alternate evaluation function 
V'(-), E has an incentive to unilaterally deviate when 
B(x2) > B(x,) despite V’(x1) < V’(x2). 


butions globally creates significant overhead, both com- 
putationally and from increased control traffic. To reduce 
overhead, we limit propagation to one hop, with nodes 
percolating to their globally appropriate position through 
repeated cycles of evaluation and topology updates. 

Unfortunately, limiting the propagation of account- 
ing information creates an incentive to ignore the ef- 
fectiveness of a peer’s contributions, a crucial consid- 
eration when structuring the topology. As an example, 
consider a simpler evaluation function that uses band- 
width directly rather than bandwidth classes: V’(x) = 
Y pepeers(B(x — p) X Daw(B(p)). Under this evalua- 
tion function, a strategic client has an incentive to de- 
viate. Consider the topology shown in Figure 6. In 
this case, the system would benefit from EF contribut- 
ing to x; since yy has much higher capacity than y2, thus 
V’(x1) > V’(x2). But, a client E* evaluating E considers 
only the effectiveness of EF, which is determined by the 
bandwidths of its peers only. Thus, when B(x2) > B(x;), 
a rational E would contribute to x2 rather than x;, despite 
the greater redistribution capacity of x;. This problem 
arises for any evaluation function with a limited view. 

Bandwidth classes mitigate this problem by making 
peers with similar capacities incomparable at evaluators. 
When using V rather than V’ to evaluate peers in Fig- 
ure 6, E* treats contributions to x; and x2 equally (if they 
are in the same bandwidth class), eliminating the incen- 
tive for E to deviate. Our assumption is that clients are 
rational, but not Byzantine. Since deviating does not of- 
fer a local benefit, clients will split ties using contribution 
effectiveness, improving overall efficiency. We consider 
colluding and Byzantine peers in Section 5.4. 

Finally, we incorporate effectiveness into our default 
contract rather than bandwidth alone in order to provide 
incentive-aware gossip, the topic we describe next. 


5.2 Topology updating policy 


Incentive-aware gossip: In addition to specifying how 
clients should make local servicing decisions, the con- 
tract also influences how the overlay topology is to be up- 
dated globally. To achieve a distribution structure where 
high capacity peers are closer to the source, Contracts 
uses peer gossip informed by V(-) as well as structural 
information provided by a hop count field in block mes- 


sages. By maintaining the average hop count of blocks 
received from peers, each client can compare the average 
distance of its peers to the source, and we use this infor- 
mation to speed convergence when evaluating potential 
peers for connection. 

Each client is aware of the capacities of its one-hop 
neighborhood of peers, and each client attempts to con- 
nect to the peers in this set with highest capacities. Uni- 
versally applied, this results in the highest capacity peers 
percolating to the source, and lower capacity nodes being 
pushed to the periphery of the mesh through attrition. 

Specifically, each client c sorts p € (Peers 0 Peers) (c) 
by BC(p), connecting to new peers in descending order. 
To split ties within a bandwidth class, c orders each po- 
tential peer by its average block hop count; 1.e., a mea- 
sure of the distance to the source, preferring the most 
distant of these. The intuition behind this policy is that 
the most distant peers within a bandwidth class are likely 
to be poorly clustered with respect to capacity and thus 
more likely to have outstanding block requests.* Prefer- 
ential connection with misplaced peers in a bandwidth 
class speeds convergence of the topology. On the re- 
ceiver’s side, clients accept incoming connections opti- 
mistically, pruning those that have neither provided data 
nor warranted service in the recent past. Recall that con- 
nectivity does not imply that a peer will be selected for 
service. Exploring new nodes serves to expand the set of 
peers for which a given client can compute V(-). 

This gossip strategy 1s incentive-aware; it incorporates 
the interests of clients seeking to maximize their V-value 
by contributing to the highest capacity peers. Consider 
the example topology in Figure 6. To compute V(-) for 
each of its peers, node E knows the bandwidth capaci- 
ties of all labeled nodes (provided by our account mech- 
anism). Suppose BC(y;) > BC(x,). In this case, E would 
increase its V-value by sending to y; directly rather than 
through x;, and so connects to yy. Although x; might 
prefer to avoid this, revealing the bandwidth capacity of 
y, to E is required to demonstrate the effectiveness of its 
own contributions. 


Bootstrapping new clients: For a newly joined, high 
capacity client to demonstrate its capability, it needs to 
receive stream data early enough to replicate that data 
widely. But, since a recently added client is typically 
placed far from the data source, the client might be over- 
looked simply because it could not receive enough useful 
data to replicate. 

To address this problem, Contracts clients advertise an 
optional bootstrapping block comprised of random data. 
Advertising the bootstrapping block serves to inform di- 
rectly connected peers that a client has excess capacity 


>Preferring distant peers is a heuristic to increase trading opportuni- 
ties. Recall that in a mesh with a capacity surplus, clients must compete 
to satisfy requests. 
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that can be verified through direct transfer. Each trans- 
ferred bootstrapping block is worth half that of a normal 
block in terms of contribution value, although this value 
need not be precise. In light of the significantly skewed 
bandwidth distribution, our goal is to encourage mean- 
ingful contribution whenever possible while still allow- 
ing high capacity peers to demonstrate their capabilities. 

Contracts’s use of bootstrapping blocks exploits a key 
characteristic of the P2P environment, bandwidth asym- 
metry, to make a tradeoff beneficial to the overlay. Be- 
cause many home broadband connections have signifi- 
cantly greater download capacity than upload capacity, 
identifying high capacity clients by downloading ran- 
dom data trades a reduction in abundant download ca- 
pacity for an increase in upload capacity, the scarce re- 
source. Of course, if a live broadcast has significantly 
more capacity than demand, bootstrapping blocks need 
not be transferred. To limit overhead, a Contracts client 
requests bootstrapping blocks only when it has excess ca- 
pacity and with a probability that decreases as the num- 
ber of its peers with excess capacity increases. This prob- 


syey . 1 __ Peers with excess capacity 
ability is given by: | — ———“aeatpeers 


5.3. Verifying contributions 


The preceding topology update policies depend on the 
peers and the tracker obtaining truthful values of the 
global contributions of the peers (e.g., the calculation of 
Equation (1)). Contracts introduces verifiable contribu- 
tions to support this task. 

Each client P using Contracts completes a one-time 
registration step with the streaming infrastructure during 
which it is provided with a unique public/private key pair 
Kp. It should be clear in the context whether Kp rep- 
resents the public key or the private key. The key pair 
serves as the client’s identity and is persistent. After- 
wards, clients are provided with two additional pieces 
of information when connecting to a channel: 1) a peer- 
specific nonce value and 2) the public key of the tracker 
of the channel. The nonce is used by several Con- 
tracts protocol messages to prevent replay attacks, and 
the tracker’s public key is used to authenticate messages 
from the tracker that are forwarded in the overlay. We 
use (M)x to denote a message M signed by key K. 

When Contracts clients connect to one another, they 
exchange their respective public keys and nonce values.* 
Afterwards, data is exchanged normally. Periodically 
during data transfer, each Contracts client mints a signed 
receipt message for each of its peers. Each receipt ac- 
counts for the most recent contributions of that client and 
is sent to the remote endpoint. For example, if a client 
P with key pair Kp has received V blocks of data from a 


4Man-in-the-middle attacks can be precluded by bundling peer keys 
with the peer list returned by the tracker. This increases overhead and 
is optional. 
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peer Q since it last sent a receipt to QO, P sends Q a receipt 
containing (No, Kp — Kg: V) xp. This includes a nonce 
(No), the sender and receiver identities, and the number 
of blocks, signed by P.> Receipts are sent when a thresh- 
old on the volume of sent data is reached. This threshold 
is set by the tracker to control load and overhead. 

Receipts serve as the foundation for verified contribu- 
tions in Contracts, and we describe both distributed and 
centralized methods for using them to evolve the over- 
lay topology. Distributed verification reduces load on the 
tracker, increasing scalability. Centralized verification 
speeds topology updates and reduces total network over- 
head, while also precluding several attacks from Byzan- 
tine users. These methods are not mutually exclusive; e1- 
ther (or both) can be used during a broadcast, and clients 
can switch between them freely depending on the level 
of contention for tracker resources. We describe each in 
turn. 


Distributed verification: When using distributed verifi- 
cation, the tracker bootstraps new peers by providing a 
random subset of candidates to each client, and times- 
tamps are used as nonce values. Each client forwards 
all receipts it receives due to contributions to its directly 
connected peers, including receipts collected from its 
one-hop neighborhood. Unlike the tracker, which gen- 
erates the keys for all valid users in the broadcast, ordi- 
nary peers that receive receipts cannot distinguish valid 
identities from those generated by a strategic user. If any 
receipt is accepted, such users can manufacture an arbi- 
trary number of receipts and claim any level of contri- 
bution. Thus, the challenge for verification in the dis- 
tributed case is identifying valid receipts. 

To do this, the tracker issues each user a small valid 
user message when the user first joins a given broad- 
cast. This message is signed by the tracker and includes 
a timestamp, channel identifier, and the public key of the 
recipient. When peers connect to one another, they ex- 
change and verify valid user messages, demonstrating to 
one another that they are a valid peer for the given broad- 
cast. 


Centralized verification: Although conceptually 
straightforward, distributed verification and contract 
enforcement assumes rational clients. A large number of 
Byzantine clients may undercut the convergence of our 
topology structuring algorithm, degrading performance. 
To address this, Contracts supports evaluating peers 
and enforcing topology updates at the tracker if neces- 
sary. Centralized topology updates also enable rapid 
and/or fine-grained adjustments to the topology during 
challenging workloads, e.g., flash crowds. The primary 
challenge to centralizing these functions is ensuring 
that the tracker is not overwhelmed with network traffic 


>For brevity, we omit the broadcast identifier, also included. 
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Computing the digest of receipts at peer Q 
1:Digest — J) 

2: Hash + 0 

3: Sort receipts by sender key 

4: for each receipt R: {N, kp — Kg: V} 

5: from client P to Q with block count V; do 
6: Increment counter for Kp in Digest 

7: Hash — SHA-1(Hash « SHA-1(R)) 

8: done 

9: Send {Digest, Hash} to coordinator 


Figure 7: Construction of the receipt digest message at 
a client Q. The « operator indicates concatenation. 


or computational demands, and the remainder of this 
section describes the mechanisms Contracts uses to 
achieve this. 

Periodically, each client contacts the tracker to report 
its continuing participation in the broadcast and requests 
an updated set of peers. In the current PPLive implemen- 
tation, this message also includes the client’s maximum 
upload rate as measured by the client. Contracts piggy- 
backs on this message, replacing the self-reported upload 
rate with a verifiable accounting of blocks contributed to 
specific peers during the previous update interval. Since 
public keys (and hence individual receipts) are lengthy, 
the naive approach of simply forwarding all receipts to 
the tracker would amount to a de-facto DDoS attack. In- 
stead, Contracts clients report a compact, plain-text di- 
gest of receipts. 

The algorithm for constructing the receipt digest mes- 
sage is given in Figure 7. The key underlying technique 
is to trade optional computation at the tracker for a sub- 
stantial reduction in network traffic. Instead of trans- 
mitting full receipts, each digest contains claims about 
receipts received and a verification hash. Claims are a 
plain-text list of contributions that allows the tracker to 
reconstruct the original contribution receipts by recom- 
puting them. A digest contains claims for each receipt re- 
ceived since the last digest was sent (line 4). Each claim 
contains the first n bits of the public key of the receiver 
specified in the full receipt (line 6). Each truncated key 
serves as an index, allowing the tracker to map an iden- 
tifier to a public/private key pair it previously generated 
for a particular user. Finally, a hash chain is computed 
over the original receipts (line 7) sorted by receiver iden- 
tifier (line 3). This can be used by the tracker to verify 
that claims correspond to valid receipts. 

A list of claims informs the tracker as to which re- 
ceivers generated receipts, but to recompute those orig- 
inal receipts and verify the hash chain, the tracker also 
needs to know the number of blocks received and the re- 
ceipt nonce. Both of these are set by the tracker when 
clients initially connect. The block threshold for dis- 
patching receipts, V, is set to control overhead both at 
the tracker and among clients. Each client’s nonce is 


selected at random by the tracker and incremented by 
clients per-peer for each receipt received. For example, if 
a client’s initial nonce is 5 and it receives 2 receipts from 
peer A and 3 from peer B in a given reporting interval, 
subsequent receipts minted by A and B to this client will 
be stamped with nonce values of 7 and 8, respectively. 
The tracker verifies increments to nonce values to pre- 
vent replay attacks, and nonce values are maintained on 
a per-peer basis to prevent concurrent data transfers from 
producing receipts with the identical nonce values. 

At the tracker, ranking clients based on the plain- 
text claims in digests requires little overhead relative to 
the existing processing already done by the tracker; ta- 
ble lookups provide the required information to com- 
pute Equation (1) (where the sum of contributed block 
claims per update interval provides contribution rates). 
Although processing is straightforward, verification 1s 
computationally intensive, requiring the tracker to regen- 
erate and hash each signed receipt. But, since only the 
plain-text content of digests is needed to rank clients, the 
tracker can shed load at any time. While sampling di- 
gests may increase susceptibility to cheating, our evalua- 
tion shows that verifying all digests on the fly is feasible 
given PPLive’s current infrastructure provisioning. 


5.4 Collusion resistance 


Contracts includes both centralized and distributed ver- 
ification of receipts to allow the tracker to manage the 
tradeoff between protocol overhead and robustness to 
malicious behavior. In the absence of Byzantine behav- 
ior, distributed verification effectively rewards contribu- 
tion without relying on centralized accounting. With iso- 
lated Byzantine agents, coordinating topology updates at 
the tracker enables convergence even while some nodes 
deviate from our default contract. This increases over- 
head, but as we show in our evaluation, not prohibitively. 
In the remainder of this section, we describe the tech- 
niques used by a tracker to use its global perspective to 
mitigate security attacks, in particular, the well-known 
P2P attack: collusion, in which a group of participants 
work collectively to subvert our accounting mechanism. 
The collusion participants we consider may include both 
real users with interest in receiving stream data, as well 
as synthetic identities created strictly for collusion. 
Limited identity creation: The tracker appeals to stan- 
dard techniques used by other P2P proposals for in- 
hibiting the creation of arbitrarily many synthetic iden- 
tities, the so-called Sybil attack [7]. In particular, the 
tracker limits the creation of new identities on the basis 
of durable identifiers, e.g., cell phone number via SMS. 
Flow integrity check: When a new client joins a broad- 
cast, the tracker evaluates its maximum upload capacity. 
Although a client may choose to upload at a lower rate, it 
cannot exceed the capacity. This restricts potential false 
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claims on BC(-). In addition, live streaming imposes a 
known incoming rate bound on each client’s long-term 
incoming data rate, which is the streaming rate. When 
verifying receipts, the tracker validates the upload capac- 
ity and incoming rate bounds. Such verification limits the 
collusion of a set of broadcast participants to issue fraud- 
ulent receipts. No group of colluders can form a loop and 
arbitrarily boost a colluder’s contribution value. Specifi- 
cally, consider a client x with the support of a total of K 
colluders. Assume that x is an actual broadcast partici- 
pant that needs to receive actual data from non-colluders. 
Then x cannot issue any fraudulent receipts, as it needs 
to issue receipts for actual data. The capability of the 
colluders to help x is also limited. The value B(x — p), 
where p is a colluder, is limited by the streaming rate r 
due to incoming rate bound on p. Thus, with K colluders 
generating fraudulent receipts, x can claim at most K -r 
fraudulent contributions to these colluders. But, K -7 can- 
not exceed the upload rate of x measured by the tracker 
for WAN traffic. Further, if a given colluder p helps x to 
claim contribution rate r, then B(p’ — p) should be zero 
for any other client p’, otherwise p violate its incoming 
rate bound. Thus, if a collusion scheme is to let B(x — p) 
be r for all K colluders, then B(p) has to be zero for all 
of the colluders. This substantially limits /(x). 


Global and diversity weighting: In spite of the preced- 
ing checks, some clients might still be able to collude 
and/or acquire several synthetic identities to increase the 
overall value of V(-) of a client. To address this, the 
tracker detects a cluster of linked colluders. Also, Con- 
tracts can optionally weight the overall value of V(-) by 
the network-level address diversity of the peers to which 
a client contributes. As a consequence of registering for 
a broadcast, the tracker knows each client’s IP address 
and port. For identities within the same IP prefix (/24), 
Contracts dampens the value of contributions when us- 
ing centralized verification. For identities registered at 
the same address (e.g., users behind a NAT), contribu- 
tions are further dampened. This policy restricts collu- 
sion by exploiting the scarcity of IP addresses. 


Note that we do not adopt a universal notion of client 
utility, and we do not claim that Contracts is strategy- 
proof, even given these defenses. An alternative ap- 
proach to mitigating collusion and strategic behavior is to 
restrict each client’s choice in peer selection. As shown 
in previous work [17, 18], limiting peer selection is a 
powerful tool for enabling formal analysis of gossip pro- 
tocols since the potential for protocol deviations is re- 
stricted. But, such restrictions may limit the potential for 
grouping peers based on locality or bandwidth, e.g., high 
bandwidth, local exchange between peers on the same 
LAN. In practice, flexible peering significantly increases 
distribution efficiency in PPLive, leading us to eschew 
restrictions which may aid in formal analysis, leaving 
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open these issues for future work. 


6 Evaluation 


Our evaluation of Contracts answers two main questions. 
First, 1s applying Contracts to streaming systems feasi- 
ble? We find that it 1s; Contracts adds modest overhead 
but does not fundamentally limit scalability. Second, is 
Contracts effective? To confirm this, we report measure- 
ments of a modified PPLive client to demonstrate the 
performance improvement of Contracts relative to other 
systems and incentive strategies. Specifically: 


e Contracts improves performance relative to unmodi- 
fied PPLive. In experiments with heterogeneous ca- 
pacities and churn, Contracts increases the number of 
clients with uninterrupted playback from 13% to 62%, 
an increase of more than 4x. 


e Contracts provides robust contribution incentives. 
Experiments in bandwidth constrained environments 
show that quality of service improves with contribu- 
tion. Moreover, Contracts provides a substantial and 
consistent improvement in quality of service relative 
to tit-for-tat. 


e Contracts is scalable, even when using centralized ver- 
ification. Using our default parameters, a single Con- 
tracts tracker can support the computational and net- 
work overhead of more than 90,000 concurrent clients. 


e Clients are quickly integrated into the mesh. After 
only a few rounds of peer exchanges, newly joined 
clients percolate to their intended locations in the over- 
lay with bandwidth clustered peers. 


6.1 Performance and incentives 


We first evaluate the performance of our Contracts im- 
plementation, which is built from modifications to the 
reference PPLive client. We show two main results: 
1) PPLive with Contracts significantly outperforms both 
unmodified PPLive and one modified to support tit-for- 
tat (TFT). 2) Contracts provides our intended contri- 
bution incentives; when the system is bandwidth con- 
strained, increasing contribution improves performance. 
Performance: We define performance as the fraction of 
data blocks received by their playback deadlines, and 
compare performance for PPLive, PPLive using Con- 
tracts, PPLive with tit-for-tat, and FlightPath [17]. For 
each of these techniques, we measure the performance of 
100 clients participating in a test broadcast on Emulab.°® 
Each client initially joins the system separated by a ten- 
second interval. To evaluate Contracts under churn, each 
client disconnects and rejoins after participating for 20 
minutes. All clients continue this process for two hours. 
To compare performance under realistic bandwidth con- 
straints, client upload capacities are drawn from our mea- 


©Emulab allows us to execute Windows binaries. 
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sured capacity distribution of PPLive clients, normal- 
ized to provide an over-provisioning factor of 2; 1.e., the 
sum of peer capacities is twice the aggregate demand. 
Crucially, however, many peers have capacity less than 
the stream data rate—a common occurrence in practice. 
Both TFT and Contracts clients actively exchange data 
with 10 directly connected peers and reevaluate these de- 
cisions every 10 seconds using the statistics of previous 
30 seconds. For FlightPath trails, we use default config- 
uration parameters described by Li, et al. [17]. 

Figure 8 shows our results. Contracts significantly im- 

proves performance relative to unmodified PPLive and 
FlightPath; 62% of Contracts clients experience loss- 
free playback compared with just 13% when using un- 
modified PPLive or 3% when using FlightPath. In other 
words, the fraction of PPLive/Contracts clients experi- 
encing loss-free playback is more than 4 times that of un- 
modified PPLive. For clients that do miss playback dead- 
lines, a larger fraction of blocks arrive in time when using 
Contracts. Relative to unmodified PPLive, tit-for-tat de- 
grades performance for the majority of clients. This is 
consistent with our analysis in Section 3. Tit-for-tat ben- 
efits high capacity clients when they happen to be placed 
near the broadcast source (y > 0.96). But, more distant 
clients cannot collect enough useful data with which to 
trade. Even high capacity clients cannot prove their ca- 
pabilities when far from the source, decreasing overall 
utilization and average performance. 
Incentives: Contracts rewards contribution with in- 
creased robustness. We evaluate this by comparing the 
performance of PPLive using Contracts with that of 
PPLive using tit-for-tat. In both cases, the system is 
bandwidth constrained. We use 100 clients with capaci- 
ties uniformly distributed between 1—2x the stream rate 
(over-provisioning factor 1.5) to connect to a test stream, 
participating in the broadcast for 10 minutes. We repeat 
this experiment 10 times. 

Figure 9 shows the results. Averages are shown with 
error bars giving the full range of block delivery rates for 
clients with a given capacity. While tit-for-tat does pro- 
vide some correlation between contribution and perfor- 
mance, the amount of improvement varies significantly 
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Figure 9: Delivery rate as a function of contribution. 


because tit-for-tat does not update the topology. In con- 
trast, Contracts combines both topology updates and lo- 
cal servicing rate decisions to provide a consistent im- 
provement in performance, strengthening incentives. 


6.2 Overhead 


In this section, we describe implementation details and 
overhead related to verifying contributions, including: 
1) state maintained by the tracker and clients, 2) compu- 
tation required to verify receipts, and 3) network control 
traffic. We discuss each of these in turn. 

State: When using centralized verification, the PPLive 
tracker maintains soft state including bandwidth capac- 
ity, client version, etc. of active clients. To these, Con- 
tracts adds a last digest update field which records the 
timestamp and content of the most recently received re- 
ceipt digest message. This is used to compute contri- 
bution rates when new digests are received, and its size 
varies depending on content. We estimate the likely 
size of receipt digest messages when computing network 
overhead (described below). 

Trackers also maintain hard state: the key pairs of reg- 
istered clients. For cryptographic operations, Contracts 
uses SHA-1 with RSA, DER-encoded PKCS#1 and 1024 
bit keys. Maintaining one million key pairs requires less 
than a gigabyte of storage on disk. A lookup table map- 
ping truncated identifiers to keys easily fits in memory 
on modern servers. For distributed verification, clients 
associate public keys and nonces with connections and 
maintain counters of verified receipts received from each 
directly connected peer. 

Network traffic: Exchanging receipt and receipt digest 
messages is the main source of network overhead in Con- 
tracts. Three related parameters influence this. The 
tracker specifies a digest interval indicating how often 
digests are reported by clients. A lengthy interval re- 
duces the number of such messages at the cost of de- 
layed topology updates or delayed detection of cheating 
clients. The receipt volume specifies how much data each 
receipt acknowledges. Finally, the stream data rate con- 
trols how many receipts are exchanged among peers. To 
make our analysis independent of stream data rate, we 
define receipt volume in terms of how many seconds of 
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Figure 10: The size of receipt digest messages as a func- 
tion of the digest update interval. 


video data each receipt acknowledges. Currently, Con- 
tracts uses a digest update interval of 15 minutes and a 
receipt volume that acknowledges 30 seconds of stream 
data. In the remainder of this section, we examine the 
tradeoffs underlying these choices. 

For a video stream of moderate quality (S00 Kbit), 
sending a receipt acknowledging every 30 seconds of 
video data imposes less than 0.1% overhead relative to 
data transfer among peers when verifying contributions 
at the tracker. Distributed verification requires forward- 
ing additional receipts from each peer’s one hop neigh- 
borhood. This increases average network overhead to 
1.2%, trading an increase in traffic among peers for a re- 
duction in traffic at the tracker, which we consider next. 

Network overhead at the tracker is determined by the 
number of receipt digest messages received. Each receipt 
digest message contains a 24 byte header and 6 byte tu- 
ples specifying a peer (4 byte truncated public key) and 
a receipt count (2 bytes). In the worst case, each digest 
would include an entry for every directly connected peer. 

In practice, only a fraction of connected peers are in- 
cluded in a single digest update. To compute this, we 
measured the amount of data uploaded to directly con- 
nected peers by an instrumented PPLive client that par- 
ticipated in popular broadcasts for 10 minutes apiece. 
Contribution is highly skewed; for each client, the top 
10% of its peers receive 60% of its contributed data, 
meaning that there are fewer entries in each digest. 

Combining our measurements of skew with the typical 
number of directly connected peers allows us to compute 
the size of receipt digest messages. Figure 10 shows this 
data for several receipt volume values. Each line shows 
the growth in the size of a digest message as a function 
of the update interval. Each data point is averaged over 
10,000 randomly generated digest messages using sam- 
ples from measured distributions to specify the directly 
connected peers and capacity skew. To compute aggre- 
gate traffic at the tracker, we multiply the average receipt 
digest size by the total number of clients. For instance, 
processing digests for 100,000 clients with our default 
parameters requires 10 KBps of tracker overhead. 
Computation: Computational requirements at the 
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clients are dominated by the demands of video playback. 
At the tracker, the computational overhead of Contracts 
is dominated by receipt verification. Verification requires 
regenerating the receipt messages specified by receipt di- 
gest messages and computing the SHA-1 hash chain for 
the generated receipts to verify the hash specified in the 
digest (Figure 7). Thus, the computational overhead of 
verification depends on the number of receipts, which is 
determined by the stream data rate and receipt volume. 

The total number of receipts per second generated by 
a channel is simply the ratio of data rate and receipt vol- 
ume multiplied by the population. A micro-benchmark 
on a single commodity server using our current imple- 
mentation can verify 3,200 receipts per second, and re- 
ceipt verification is embarrassingly parallel. If receipts 
encapsulate 30 seconds worth of video data, our cur- 
rent implementation can verify receipts for more than 
90,000 simultaneous clients in real-time using a single 
server. In practice, management of so large a broadcast 
is already distributed across several servers in PPLive, 
meaning that receipt verification with Contracts does not 
dominate resource usage when scaling the coordination 
infrastructure. As with network overhead, Contracts al- 
lows the tracker to shed computational load when re- 
quired. receipt digest messages that are not cryptographi- 
cally verified can still be used to evolve the topology and 
(optionally) stored for later verification. This increases 
the window of vulnerability to a cheating client but does 
not degrade the efficiency of distribution. 


6.3 Convergence 


We next consider the integration of new clients into the 
mesh. Convergence of clients to their intended location 
in the topology is determined by many factors. We con- 
sider two explicitly: 1) the capacity of a newly joined 
client, and 2) the number of newly joining clients; e.g., 
integrating a flash crowd may require additional peer ex- 
changes relative to integrating a single client into a sta- 
ble mesh. We measure convergence in terms of update 
rounds; i.e., the interval between peer gossip connec- 
tions. To understand convergence at scale, we use trace- 
driven simulation of Contracts using default parameters. 

We first evaluate convergence as a function of a newly 
joined client’s bandwidth capacity. For each capacity, the 
new client joins a 10,000 user channel with stable mem- 
bership, and we record the number of topology updates 
required for the newly joined client to reach a stable po- 
sition. We consider a client to have reached a stable po- 
sition when the average capacity of its net contributors 
(1.e., those that provide more blocks than they receive) is 
within 5% of the average capacity of net consumers. The 
vast majority of peers (> 80%) reach a stable position in 
four update rounds or less. Broadly, the results are con- 
sistent with the variation in observed bandwidth capacity. 
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Figure 11: The number of peer exchanges required for a 
set of newly joined clients to reach stable matchings as a 
function of the number of arrivals in a flash crowd. 


Low capacity peers can quickly discover a stable set of 
similarly low capacity peers, while high capacity peers 
need several rounds to stabilize. 


Next, we consider topology convergence for flash- 
crowd arrivals. In this case, we simulate a channel with 
1,000 initial participants and vary the number of joining 
clients. Each new client is assigned a capacity drawn 
from the same distribution as the existing clients, provid- 
ing a constant amount of resources in the system. Fig- 
ure 11 shows the number of rounds required to achieve 
stability for the last newly joined client in the crowd. 
The number of rounds required increases logarithmically 
with the number of joining peers. 


6.4 Over-provisioning 


Our evaluation thus far has focused on the performance 
of Contracts in settings with a specific amount of over- 
provisioning; 1.e., capacity in excess of total demand. We 
now evaluate over-provisioning directly, measuring the 
performance of PPLive and Contracts while scaling our 
measured capacity distribution to vary the ratio of capac- 
ity to demand. We measure the block delivery rate of 100 
static PPLive clients running on Emulab. As in previous 
experiments, we record each client’s block delivery rate. 
Figure 12 summarizes the results, with error bars show- 
ing the 5th and 95th percentiles of delivery rate across all 
clients. In each trial, the average delivery rate PPLive us- 
ing Contracts exceeds that of unmodified PPLive. When 
capacity is limited, low capacity clients are penalized by 
Contracts, contributing to variations in performance. As 
capacity increases, however, Contracts delivers consis- 
tently higher quality of service for all peers. Taken to- 
gether, these results show that Contracts achieves consis- 
tently higher performance for a range of operating con- 
ditions, and delivers on our overall goals of improving 
efficiency and providing contribution incentives. When 
the system is capacity rich, Contracts improves distri- 
bution efficiency, improving delivery rate for all peers. 
But, during periods of resource contention, high capac- 
ity peers receive better quality of service. 
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Figure 12: The impact of over-provisioning on PPLive’s 
performance. Data points show the average fraction of 
blocks received by their playback deadlines. 


7 Related work 


Our work builds on a large body of prior work focused 
on live streaming, P2P data distribution, and incentives. 

The notion of a P2P approach to data streaming was 
pioneered by Narada, Overcast and Yoid [4, 14, 8]. These 
designs tried to approximate multicast support using a 
tree structured overlay. SplitStream, a subsequent de- 
sign, addressed the limited utilization of leaf nodes in 
tree-based schemes [2]. 

Subsequent work has applied swarming designs, 
borrowed from BitTorrent-like systems, to video-on- 
demand and live streaming. Coolstreaming/DONet ap- 
plies a mesh-based network structure to live stream- 
ing [28]. Annapureddy, et al. argue that high quality 
video on demand is feasible using a P2P architecture, 
a point reinforced by recent work describing PPLive’s 
video-on-demand P2P implementation [13] as well as 
other publicly available commercial streaming systems 
(e.g., PPStream, SopCast, TVAnts, and UUSee). 

More recent work has studied incentives in bulk data 
distribution in widely deployed systems, particularly Bit- 
Torrent. Qiu and Srikant studied BitTorrent formally, 
finding that it achieves a Nash equilibrium under certain 
conditions [25], although more recent work has shown 
practical mechanisms for subverting BitTorrent’s incen- 
tive strategy [22]. These advancements in understand- 
ing the subtlety of bilateral exchange motivated our con- 
sideration of its application to live streaming. In [1], 
Aperyjis, et al. extend bilateral exchange to multilateral 
exchange by introducing prices. They compare their 
scheme with BitTorrent and show improvements in ef- 
ficiency and robustness. One hop reputations [23] use 
limited propagation of contribution information to im- 
prove incentives in BitTorrent; we apply similar ideas to 
live streaming. 

Most related to our work are systems that address in- 
centives in live streaming (e.g., [10, 12, 18, 19, 21, 26, 
27]). Sung, et al. describe a live streaming design that 
rewards contribution but depends on honest capacity re- 
porting by peers [27]. SecureStream introduces proto- 
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col mechanisms to defend against several attacks (e.g., 
forged data and denial of service [12]). These techniques 
are largely complementary to our work, which focuses 
on verifiably rewarding contribution. BAR Gossip an- 
alyzes incentives in streaming formally and describes a 
protocol designed to induce contributions from rational 
users [18]. FlightPath relaxes several constraints of BAR 
Gossip (e.g., by allowing dynamic membership), but still 
requires the long-term balance of contribution and con- 
sumption to provide contribution incentives. Our experi- 
ence applying rate-based tit-for-tat is consistent with that 
of Pianese, et al. who apply TFT to live streaming and 
experimentally confirm the need for significant altruism 
to achieve robustness [21]. 


Motivated by the practical challenges of client het- 
erogeneity, we take a different approach in the design 
of Contracts, providing incentives via a global contract 
and including explicit topology restructuring in our algo- 
rithm design. Habib, et al. propose providing high capac- 
ity clients with additional peers to improve their service 
quality [10], but such improvements are not assured in 
environments with high levels of bandwidth heterogene- 
ity. 

Several live-streaming systems focus on providing ro- 
bustness by enforcing contribution amounts. Chu, et al. 
propose mandatory, centrally enforced taxation in the 
context of multi-tree live streaming [3]. Haridasan, et 
al. provide a two-level auditing scheme for live stream- 
ing that ensures that peers contribute more than a thresh- 
old amount of data [11]. Local auditing and gossip pro- 
vide an immediate but partial check on user’s contribu- 
tion, while global audit ensures that a misbehaving node 
is caught. Rather than punishing nodes that do not con- 
tribute a sufficient amount, Contracts rewards nodes for 
voluntarily contributing as much as possible. 


$ Conclusion 


We have examined performance and contribution incen- 
tives for live streaming systems. The unique features of 
the P2P live streaming environment limit the effective- 
ness of many widely-used incentive strategies based on 
balanced or bilateral exchange. These challenges moti- 
vate the design of Contracts, a new incentive strategy that 
rewards contribution with quality of service by evolving 
the overlay topology. Building on a protocol that pro- 
vides verifiable contributions, we have shown that the 
use of Contracts both improves performance relative to 
PPLive and strengthens contribution incentives relative 
to existing approaches without curtailing scalability. 
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Abstract 


CoralCDN is a self-organizing web content distribution 
network (CDN). Publishing through CoralCDN is as sim- 
ple as making a small change to a URL’s hostname; a 
decentralized DNS layer transparently directs browsers to 
nearby participating cache nodes, which in turn cooperate 
to minimize load on the origin webserver. CoralCDN has 
been publicly available on PlanetLab since March 2004, 
accounting for the majority of its bandwidth and serving 
requests for several million users (client IPs) per day. This 
paper describes CoralCDN’s usage scenarios and a num- 
ber of experiences drawn from its multi-year deployment. 
These lessons range from the specific to the general, touch- 
ing on the Web (APIs, naming, and security), CDNs (ro- 
bustness and resource management), and virtualized host- 
ing (visibility and control). We identify design aspects and 
changes that helped CoralCDN succeed, yet also those that 
proved wrong for its current environment. 


1 Introduction 


The goal of CoralCDN was to make desired web content 
available to everybody, regardless of the publisher’s own 
resources or dedicated hosting services. To do so, Coral- 
CDN provides an open, self-organizing web content distri- 
bution network (CDN) that any publisher is free to use, 
without any prior registration, authorization, or special 
configuration. Publishing through CoralCDN is as simple 
as appending a suffix to a URL’s hostname, e.g., http: / 
/example.com.nyud.net/. This URL modification 
may be done by clients, origin servers, or third parties that 
link to these domains. Clients accessing such Coralized 
URLs are transparently directed by CoralCDN’s network 
of DNS servers to nearby participating proxies. These 
proxies, in turn, coordinate to serve content and thus min- 
imize load on origin servers. 

CoralCDN was designed to automatically and scalably 
handle sudden spikes in traffic for new content [14]. It 
can efficiently discover cached content anywhere in its net- 
work, and it dynamically replicates content in proportion 
to its popularity. Both techniques help minimize origin re- 
quests and satisfy changing traffic demands. 

While originally designed for decentralized and unman- 
aged settings, CoralCDN was deployed on the PlanetLab 
research network [27] in March 2004, given PlanetLab’s 


convenience and availability. CoralCDN has since re- 
mained publicly available for more than five years at hun- 
dreds of PlanetLab sites world-wide. Accounting for a ma- 
jority of public PlanetLab traffic and users, CoralCDN typ- 
ically serves several terabytes of data per day, in response 
to tens of millions of HTTP requests from around two mil- 
lion users (unique client IP addresses). 

Over the course of its deployment, we have come to 
acknowledge several realities. On a positive note, Coral- 
CDN’s notably simple interface led to widespread and in- 
novative uses. Sites began using CoralCDN as an elas- 
tic infrastructure, dynamically redirecting traffic to Coral- 
CDN at times of high resource contention and pulling back 
as traffic levels abated. On the flip side, fundamental parts 
of CoralCDN’s design were ill-suited for its deployment 
and the majority of its use. If one were to consider the var- 
ious reasons for its use—for resurrecting long-unavailable 
sites, supporting random surfing, distributing popular con- 
tent, and mitigating flash crowds—CoralCDN’s design is 
insufficient for the first, unnecessary for the second, and 
overkill for the third, at least given its current deployment. 
But diverse and unanticipated use is unavoidable for an 
Open system, yet openness is a necessary design choice for 
handling the final flash-crowd scenario. 

This paper provides a retrospective of our experience 
building and operating CoralCDN over the past five years. 
Our purpose is threefold. First, after summarizing Coral- 
CDN’s published design [14] in Section §2, we present 
data collected over the system’s production deployment 
and consider its implications. Second, we discuss various 
deployment challenges we encountered and describe our 
preferred solutions. Some of these changes we have im- 
plemented and incorporated into CoralCDN; others require 
adoption by third-parties. Third, given these insights, we 
revisit the problem of building a secure, open, and scalable 
content distribution network. More specifically, this paper 
addresses the following topics: 


e The success of CoralCDN’s design given observed us- 
age patterns (§3). Our verdict is mixed: A large ma- 
jority of its traffic does not require any cooperative 
caching at all, yet its handling of flash crowds relies 
on such cooperation. 


¢ Web security implications of CoralCDN’s open API 
(§4). Through its open API, sites began leveraging 
CoralCDN as an elastic resource for content distri- 
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bution. Yet this very openness exposed a number of 
web security challenges. Many can be attributed to 
a lack of explicitness for specifying appropriate pro- 
tection domains, and they arise due to violations of 
traditional security principles (such as least privilege, 
complete mediation, and fail-safe defaults [33]). 


e Resource management in CDNs (§5). CoralCDN 
commonly faced the challenge of interacting with 
oversubscribed and ill-behaved resources, both re- 
mote origin servers and its own deployment platform. 
Various aspects of its design react conservatively to 
change and perform admission control for resources. 


Desired properties for deployment platforms (§6). 
Application deployments could benefit from greater 
visibility into and control over lower layers of their 
platforms. Some challenges are again confounded 
when information and policies cannot be expressed 
explicitly between layers. 


Directions for building large-scale, cooperative 
CDNs (§7). While using decentralized algo- 
rithms, CoralCDN currently operates on a centrally- 
administered, smaller-scale testbed of trusted servers. 
We revisit the challenge of escaping this setting. 


Rather than focus on CoralCDN’s self-organizing algo- 
rithms, the majority of this paper analyzes CoralCDN as an 
example of an open web service on a virtualized platform. 
As such, the experiences we detail may have implications 
to a wider audience, including those developing distributed 
hash tables (DHTs) for key-value storage, CDNs or web 
services for elastic provisioning, virtualized network fa- 
cilities for programmable networks, or cloud computing 
platforms for virtualized hosting. While many of the ob- 
servations we report are neither new nor surprising in hind- 
sight, many relate to mistakes, oversights, or limitations of 
CoralCDN’s original design that only became apparent to 
us from its deployment. 

We next review CoralCDN’s architecture and protocols; 
a more complete description can be found in [14]. All sys- 
tem details presented after §2 were developed subsequent 
to that publication. We discuss related work throughout 
the paper as we touch on different aspects of CoralCDN. 


2 Original CoralCDN Design 


The Coral Content Distribution Network is composed of 
three main parts: (1) a network of cooperative HTTP prox- 
ies that handle client requests from users, (2) a network 
of DNS nameservers for nyud.net that map clients to 
nearby CoralCDN HTTP proxies, and (3) the underlying 
Coral indexing infrastructure and clustering machinery on 
which the first two applications are built. This paper con- 
sistently refers to the system’s indexing layer as Coral, and 
the entire content distribution system as CoralCDN. 
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Figure 1: The steps involved in serving a Coralized URL. 


2.1 System overview 


At a high level, the following steps occur when a client 
issues a request to CoralCDN, as shown in Figure 1. 


1. Resolving DNS. A client resolves a “Coralized” 
domain name (e.g., of the form example.com. 
nyud.net) using CoralCDN nameservers. A Coral- 
CDN nameserver probes the client to determine its 
round-trip-time and uses this information to deter- 
mine appropriate nameservers and proxies to return. 


2. Processing HTTP client requests. The client sends 
an HTTP request for a Coralized URL to one of the 
returned proxies. If the proxy is caching the web ob- 
ject locally, it returns the object and the client is fin- 
ished. Otherwise, the proxy attempts to find the ob- 
ject on another CoralCDN proxy. 


3. Discovering cooperative-cached content. The proxy 
looks up the object’s URL in the Coral indexing layer. 


4. Retrieving content. If Coral returns the address of a 
node caching the object, the proxy fetches the object 
from this node. Otherwise, the proxy downloads the 
object from the origin server example.com. 


5. Serving content to clients. The proxy stores the web 
object to disk and returns it to the client browser. 


6. Announcing cached content. The proxy stores a ref- 
erence to itself in Coral, recording the fact that 1s now 
caching the URL. 


This section reviews the design of the Coral indexing layer 
and the CDN’s proxies, as proposed in [14]. 


2.2 Coral indexing layer 


The Coral indexing layer is closely related to the structure 
and organization of distributed hash tables like Chord [34] 
and Kademlia [23], with the latter serving as the basis for 
its underlying algorithm. The system maps opaque keys 
onto nodes by hashing their value onto a flat, semantic-free 
identifier (ID) space; nodes are assigned identifiers in the 
same ID space. It allows scalable key lookup (in O(log(n) ) 
overlay hops for n-node systems), reorganizes itself upon 
network membership changes, and provides robust behav- 
ior against failure. 
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Compared to “traditional” DHTs, Coral introduced a 
few novel techniques that were well-suited for its partic- 
ular application [13]. Its key-value indexing layer was 
designed with weaker consistency requirements in mind, 
and its lookup structure self-organized into a locality- 
optimized hierarchy of clusters of peers. After all, a client 
need not discover all proxies caching a particular file, it 
only needs to find several such proxies, preferably ones 
nearby. Like most DHTs, Coral exposes put and get oper- 
ations, to announce one’s address as caching a web object, 
and to discover other proxies caching the object associated 
with a particular URL, respectively. Inserted addresses are 
soft-state mappings with a time-to-live (TTL) value. 

Coral’s put and get operations are designed to spread 
load, both within the DHT and across CoralCDN proxies. 
To get the proxy addresses associated with a key k, a node 
traverses the ID space with iterative RPCs, and it stops 
upon finding any remote peer storing values for k. This 
peer need not be the one closest to k (in terms of DHT 
identifier space distance). To put a key/value pair, Coral 
routes to nodes successively closer to k and stops when 
finding either (1) the nodes closest to & or (2) one that is 
experiencing high request rates for k and already is caching 
several corresponding values (with longer-lived TTLs). It 
stores the pair at the node closest to k that it managed to 
reach. These processes prevent tree saturation in the DHT. 

To improve locality, these routing operations are not 
initially performed across the entire global overlay. In- 
stead, each Coral node belongs to several distinct routing 
structures called clusters. Each cluster is characterized by 
a maximum desired network round-trip-time (RTT). The 
system is parameterized by a fixed hierarchy of clusters 
with different expected RTT thresholds. Coral’s deploy- 
ment uses a three-level hierarchy, with level-O denoting the 
global cluster and level-2 the most local one. Coral em- 
ploys distributed algorithms to form localized, stable clus- 
ters, which we briefly return to in §5.3. 

Every node belongs to one cluster at each level, as in 
Figure 2. Coral queries nodes in fast clusters before those 
in slower clusters. This both reduces lookup latency and 
increases the chance of returning values stored at nearby 
nodes, which correspond to addresses of nearby proxies. 


2.3 The CoralCDN HTTP proxy 


CoralCDN seeks to aggressively minimize load on origin 
servers. This section summarizes how its proxies use Coral 
for inter-proxy cooperation and adaptation to flash crowds. 


2.3.1 Locality-optimized inter-proxy transfers 


Each CoralCDN proxy keeps a local cache from which it 
can immediately fulfill client requests. When a client re- 
quests a non-resident URL, CoralCDN proxies attempt to 
fetch web content from each other, using the Coral index- 
ing layer for discovery. A proxy only contacts a URL’s 
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Figure 2: Coral’s three-level hierarchical overlay structure. A node 
first queries others in its level-2 cluster (the dotted rings), where 
pointers reference other caching proxies within the same cluster. If a 
node finds a mapping in its local cluster (after step 2), its get finishes. 
Otherwise, it continues among its level-1 cluster (the solid rings), and 


finally, if needed, to any node within the global level-0 system. 


origin server after the Coral indexing layer provides no re- 
ferrals or none of its referrals return the data. 

CoralCDN’s inter-proxy transfers are optimized for lo- 
cality, both from their use of parallel connections to other 
proxies and by the order in which neighboring proxies are 
contacted. The properties of Coral’s hierarchical index- 
ing ensures that the list of proxies returned by get will be 
sorted based on their cluster distance to the request initia- 
tor. Thus, proxies will attempt to contact level-2 neighbors 
before level-1 and level-O proxies, respectively. 


2.3.2 Rapid adaptation to flash crowds 


Unlike many web proxies, CoralCDN is explicitly de- 
signed for flash-crowd scenarios. If a flash crowd suddenly 
arrives for a web object, proxies self-organize into a form 
of multicast tree for retrieving the object. Data streams 
from the proxies that started to fetch the object from the 
origin server to those arriving later. This limits concurrent 
object requests to the origin server upon a flash crowd. 
CoralCDN provides such behavior by cut-through rout- 
ing and optimistic references. First, CoralCDN’s use of 
cut-through routing at each proxy helps reduce transmis- 
sion time for larger files. That is, a proxy will upload por- 
tions of a object as soon as they are downloaded, not wait- 
ing until it receives the entire object. Second, proxies opti- 
mistically announce themselves as sources of content. As 
soon as a CoralCDN proxy begins receiving the first bytes 
of a web object—either from the origin or another proxy— 
it inserts a reference to itself into Coral with a short TTL 
(30 seconds). It continually renews this short-lived refer- 
ence until either it completes the download (at which time 
it inserts a longer-lived reference!) or the download fails. 
'The deployed system uses 2-hour TTLs for successful results (status 


codes of 200, 301, 302, etc.), and 15-minute TTLs for 403, 404, and other 
unsuccessful, non-transient results. 
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Figure 3: Total HTTP requests per day during CoralCDN’s deploy- 


ment. Grayed regions correspond to missing or incomplete data. 
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Figure 4: CoralCDN usage: number of unique clients (left) and 
upload volume (right) for each day during August 9-18. 


2.4 Implementation and deployment 


CoralCDN is composed of three stand-alone applications. 
The Coral daemon provides the distributed indexing layer, 
accessed over UNIX domain sockets from a simple client 
library linked into applications such as CoralCDN’s HTTP 
proxy and DNS server. All three are written from scratch. 
Coral network communication uses Sun RPC over UDP, 
while CoralCDN proxies transfer content via standard 
HTTP connections. At initial publication [14], the Coral 
daemon was about 14,000 lines of C++, the DNS server 
2,000 LOC, and the proxy 4,000 LOC. CoralCDN’s im- 
plementation has since grown to around 50,000 LOC. The 
changes we later discuss help account for this increase. 
CoralCDN typically runs on 300-400 PlanetLab servers 
(about 70-100 of which run its DNS server), spread over 
100-200 sites worldwide. It avoids Internet2-only and 
commercial sites, the latter due to policy decisions that re- 
strict their use for open services. CoralCDN uses no spe- 
cial knowledge of these machines’ locations or connectiv- 
ity (e.g., GPS coordinates, routing information, etc.). Even 
though CoralCDN runs on a centrally-managed testbed, 
its mechanisms remain decentralized and self-organizing. 
The only use of centralization is for managing software 
and configuration updates and for controlling run status. 


3 Analyzing CoralCDN’s Usage 


This section presents some HTTP-level data from Coral- 
CDN’s deployment and considers its implications. 


3.1 System traces and traffic patterns 


To understand some of the HTTP traffic patterns that 
CoralCDN sees, we analyzed several datasets in increasing 
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Figure 5: CoralCDN traffic statistics for an arbitrary day (Aug 9). 


depth. Figure 3 plots the total number of HTTP requests 
that the system received each day from mid-2004 through 
early 2010, showing both the number of HTTP requests 
from clients, as well as the number of requests issued to 
upstream CoralCDN peers or origin sites. The traces show 
common request rates for much of CoralCDN’s deploy- 
ment between 5 and 20 million HTTP requests per day, 
with more recent rates of 40-50 million daily requests. 


We examined three time periods from these logs in more 
depth, each consisting of HTTP traffic over the same nine- 
day period (August 9-18) in 2005, 2007, and 2009. Coral- 
CDN received 15—25M requests during each day of these 
periods. Figure 4 plots the total number of unique client IP 
addresses from which these requests originated (left) and 
the aggregate amount of bandwidth uploaded (right). The 
traces showed 1-2 million clients per day, resulting in a 
few terabytes of content transferred. We will primarily use 
the 2009 trace, consisting of 209M requests, in later anal- 
ysis. Figure 5 provides more information about the traffic 
patterns, focusing on the first day of each trace. 


Figure 6 plots the distribution of requests per unique 
URL. We see that the number of requests per URL follows 
a Zipf-like distribution, as common among web caching 
and proxy networks [5]. Certain URLs are very popular— 
the so-called “head” of the distribution—such as the most 
popular one in the Aug-9-2009 trace, which received al- 
most 1.6M requests itself. A large number of URLs—the 
distribution’s “heavy tail”—receive only a single request. 


The datasets also show stability in the most popular 
URLs and domains over time. In all three datasets, the 
most popular URL retained that ranking across all nine 
days. In fact, this URL in the 2007 and 2009 traces be- 
longed to the same domain: a site that uses CoralCDN to 
distribute rule-set updates for the popular Firefox AdBlock 
browser extension. Exploring this further, Figure 7 uses 
the 2009 trace to plot the request rate per day for the most 
popular domains (taking the union of each day’s most pop- 
ular five domains resulted in nine unique domains). We see 
that six of the nine domains had stable traffic patterns— 
they were long-term CoralCDN “customers”—while three 
varied between two and six orders of magnitude per day. 
The traffic patterns that we see in these two figures have 
design implications, which we discuss next. 


*The peak of 120M requests on August 21, 2008 corresponds to a 
short-lived experiment of an academic research project using CoralCDN 
as a key-value store [15]. 


USENIX Association 


USENIX Association 





1e+06 
a PR Aug-9-2005 —= 
or g 
5 100000 Aug-9-2007 -====== 
> 10000 Aug-9-2009 =++:*++ 
oO. 
a 1000 
w” 
© 100 
o 10 
oO 


1 ae 
100 1000 10000 100000 1¢e+06 


Unique URLs by Popularity 
Figure 6: Total requests per unique URL. 


1 10 


WO! atucen ile caine acs ana tee ae ee 
1e+06 £ 
100000 
10000 F~. 
1000 
100 

10 


v 
1 
1 
1 
1 
1 
1 
1 
1 
1 

. sje 
1 


Requests per domain 





1 2 3 4 5 6 7 8 9 
Time (Days) 
Figure 7: Requests per top-5 domain over time (Aug 9-18, 2009). 


3.2 Implications of usage scenarios 


For CoralCDN to help under-provisioned websites survive 
unexpected traffic spikes, it does not require any prior reg- 
istration or authorization. Yet while such openness 1s nec- 
essary to enable even unmanaged websites to survive flash 
crowds, it comes at a cost: CoralCDN is used in a variety 
of ways that differ from this more narrow goal. This sec- 
tion considers how well CoralCDN’s design is suited for 
its four main usage scenarios: 


1. Resurrecting old content: Anecdotally, some clients 
attempt to use CoralCDN for long-term durability. 
One can download browser plugins that link to both 
CoralCDN and archive.org as potential sources 
of content when origin servers are unavailable. 


2. Accessing unpopular content: CoralCDN’s request 
distribution shows a heavy tail of unpopular URLs. 
Servers may Coralize URLs that few visit. And some 
clients use CoralCDN as a more traditional proxy, 
for (presumed) anonymity, censorship or filtering cir- 
cumvention [32], or automated crawling. 


3. Serving long-term popular content: Most requests 
are for a small set of popular objects. These objects, 
already widely cached across the network, belong to 
the stable set of customer domains that effectively use 
CoralCDN as a free, long-term CDN provider. 


4. Surviving flash crowds to content: Finally, Coral- 
CDN is used for its stated goal of enabling underpro- 
visioned websites to withstand transient load spikes. 
Popular portals regularly link to Coralized URLs, and 
users post links in comments. Some sites even adopt 
dynamic and programmatic mechanisms to redirect 
requests to CoralCDN, based on observed load and 
request referrers. We discuss this further in §4.1. 


Unfortunately, CoralCDN’s design is not well-suited for 
the first three use cases. 


Top URLs | Total Size (MB) | % of Total Reqs 


14 
157 


3744 
28734 





Figure 8: CoralCDN’s working set size for its most popular URLs 
on Aug 9, 2009: A small percentage of URLs account for a large 


fraction of requests, yet they require relatively little storage to cache. 


Insufficient for resurrecting old content. CoralCDN is 
not designed for archival storage. Proxies do not proac- 
tively replicate content for durability, and unpopular con- 
tent is evicted from proxy caches over time. Further, if 
content has an expiry time (default is 12 hours), a proxy 
will serve expired content for at most 24 hours after the 
origin fails. Still, some clients attempt to use Coral- 
CDN for this purpose. This underscores a design trade- 
off: In stressing support for flash crowds rather than long- 
term durability, CoralCDN devotes its resources to provide 
availability for content being actively requested. On the 
other hand, by serving expired content for a limited dura- 
tion, CoralCDN can mask the temporary unavailability of 
an origin, at least for content already cached in its network. 


Unnecessary for unpopular content. While proxies 
can discover even rare cached content, CoralCDN does not 
provide any benefit by serving such unpopular content: It 
does not reduce servers’ load meaningfully, and it often 
results in higher client latency. As such, clients that use 
CoralCDN to avoid local filtering, circumvent geographic 
restrictions, or provide (minimal) anonymity may be better 
served by standard open proxies (that vanilla browsers can 
be configured to use) or through specialized tools such as 
Tor [12]. Yet, this type of usage persists—the long tail of 
Figure 6—and CoralCDN might then be better served with 
a different design for such traffic, 7.e., one that doesn’t re- 
quire a multi-hop, wide-area DHT lookup to complete be- 
fore fetching content from the origin. For example, for its 
modest deployment on PlanetLab, each Coral node could 
maintain connectivity to all others and simply use consis- 
tent hashing for a global, one-hop DHT [17, 37]. Alter- 
natively, Coral could only maintain connections with re- 
gional peers and eschew global lookups, a design which 
we evaluate further in §7. 


Overkill for stably popular content, so far. For most 
of CoralCDN’s traffic, cooperation is not needed: Figure 6 
shows that a small number of URLs accounts for a large 
fraction of requests. We now measure their working set 
size in Figure 8, in order to determine how much storage is 
required to handle this traffic. We find that the most popu- 
lar 0.01% of URLs account for more than 49% of the total 
requests to CoralCDN, yet require only 14 MB of storage. 
Each proxy has a 3.0 GB disk cache, managed using an 
LRU eviction policy. This is sufficient for serving nearly 
85% of all requests from local cache. 
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70.4% hit in local cache 
12.6% returned 4xx or 5xx error code 
9.9% fetched from origin site 
7.1% fetched from other CoralCDN proxy 
+ 1.7% from level-0 cluster (global) 
+ 1.9% from level-1 cluster (regional) 
+ 3.6% from level-2 cluster (local) 


Figure 9: CoralCDN access ratios for content during Aug 9, 2009. 


These workload distributions support one aspect of 
CoralCDN’s design: Content should be locally cached 
by the “forward” CoralCDN proxy directly serving end- 
clients, given that small to moderate size caches in these 
proxies can serve a very large fraction of requests. This 
differs from the traditional DHT approach of just storing 
data on a small number of globally-selected proxies, so- 
called “server surrogates” [8, 37]. 

If CoralCDN’s working set can be fully cached by each 
node, we should understand how much cooperation is ac- 
tually needed. Figure 9 summarizes the extent to which 
proxies cooperate when handling requests. 70% of re- 
quests to proxies are satisfied locally, while only 7% result 
in cooperative transfers. (The high rate of error messages 
is due to admission control as a means of bandwidth man- 
agement, which we discuss in §5.2.) In short, at least for its 
current workload and environment, only a small fraction of 
CoralCDN’s traffic uses its cooperation mechanisms. 

A related result about the limits of cooperative caching 
had been observed earlier [38], but from the perspective of 
limited improvements in client-side hit rates. This is a sig- 
nificantly different goal from reducing server-side request 
rates, however: Non-cooperating groups of nodes would 
each individually request content from the origin. 

This design trade-off comes down to the question of 
how much traffic is too much for origin servers. For 
moderately-provisioned origins, such as the customers of 
commercial CDNs, a caching system might only rely on 
local or regional cooperation. In fact, Akamai’s network 
is designed precisely so: Nodes within each of its ap- 
proximately 1000 clusters cooperate, but each cluster typi- 
cally fetches content independently from origin sites [22]. 
To replicate such scenarios, Coral’s clustering algorithms 
could be used to self-organize a network into local or re- 
gional clusters. It could thus avoid the manual configura- 
tion of Harvest [7] or colocated deployments of Akamai. 

On the other hand, while cooperation is not needed for 
most traffic, CoralCDN’s ability to react quickly to flash 
crowds—to offload traffic from a failing or oversubscribed 
origin—1s precisely the scenario for which it was designed 
(and commercial CDNs are not). We consider these next. 


Useful for mitigating flash crowds. CoralCDN’s traces 
regularly show spikes in requests to different URLs. We 
find, however, that these flash crowds grow in popularity 
on the order of minutes, not seconds. There is a sufficiently 
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Figure 10: Flash crowd to a Coralized URL linked to by Slashdot. 
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Figure 11: Mini-flash crowds during August 2009 trace. Each dat- 
apoint represents a one-minute duration; embedded subfigures show 


request rates for the tens of minutes around the onset of flash crowds. 


long leading edge before traffic rises by several orders of 
magnitude, which has interesting implications. 

Figures 10 and 11 show the request patterns of several 
flash crowds that CoralCDN experienced. The former was 
to a site linked to in a Slashdot article in May 2005. After 
rising, the Slashdot flash crowd lasted less than three hours 
in duration and came to an abrupt conclusion (perhaps as 
the story dropped off the website’s main page). The latter, 
covering our August 2009 trace, shows spikes to the 1m- 
age cache of a less popular portal (moonbuggy.org), as 
well as to a well-publicized mirror for the collaboratively- 
filtered reddit.com, with another attenuated spike 24 
hours later. The embedded graphs in Figure 11 depict the 
request rates around the onset of the traffic spike for a nar- 
rower range of time. All three flash crowds show that the 
initial rise took minutes. 

For a more quantitative analysis of the frequency of flash 
crowds, we examined the prevalence of domains that ex- 
perience a large increase in their request rates from one 
time period to the next. In particular, Figure 12 consid- 
ers all five-second periods across the August 2009 ten- 
day trace. The left graph plots a complementary cumu- 
lative distribution function (CCDF) of the percentage of 
domains requested in each period that experience a 10- or 
100-fold rate increase. The right graph plots the percent- 
age of requests accounted for by these domains that ex- 
perience orders-of-magnitude (OOM) increases. Sudden 
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Figure 12: CCDF of extent of flash-crowd dynamics in August 
2009 trace. Left graph shows percentage of domains experiencing or- 
ders of magnitude (OOM) changes in request rates across five-second 


epochs. Right shows % requests for which these domains account. 
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Figure 13: CDFs of percentage of requests accounted for by do- 
mains experiencing order(s)-of-magnitude rate increases. Rate in- 
creases computed across epochs of 30 seconds (top left), 10 minutes 
(top right), six hours (bottom left), and one day (bottom right). Plots 
start on the y-axis with zero domains having such an increase, e.g., 


28% of 30s epochs have no domains with a > 1 OOM rate increase. 


increases do exist, but they are rare. In 76% of 5s epochs, 
no domains experienced any 10-fold increase, while in 1% 
of epochs, 1.7% of domains (accounting for 12.9% of re- 
quests) increased by one order-of-magnitude. Larger dy- 
namism was even more rare: only in 0.006% of epochs did 
there exist a domain that experienced a 100-fold increase 
in request rate. No three OOM increase occurred. 

To further understand the precipitousness of “flash” 
crowds, Figure 13 extends this analysis across longer du- 
rations. Among 30s epochs, 50% of epochs have at most 
0.4% of domains experience a 10-fold increase in their 
rates (not shown), which account for a total of 1.0% of 
requests (top left). Only 0.29% of 30s epochs have any 
domains with more than a 100-fold rate increase. At 10- 
minute epochs, 28% of epochs have at least one domain 
that experiences a two OOM rate increase, while 0.21% 
have a domain with a three OOM increase. Still, these 
flash crowds account for a small fraction of total requests: 
Domains experiencing 100-fold increases accounted for at 
least 1% of all requests in only 3.8% of 10m epochs, and 
10% of requests in 0.05% of epochs. 


>To avoid overcounting unpopular domains, we do not count changes 
when the absolute number of requests to a domain in a given time period 
is less than some minimum amount, i.e., 10 requests for 5s, 30s, and 10m 
periods, and 100 requests for 6h and Id periods. 


In short, this data shows that (1) only a small fraction 
of CoralCDN’s domains experience large rate increases 
within short time periods, (2) those domains’ traffic ac- 
counts for a small fraction of the total requests, and (3) any 
rate increases very rarely occur on the order of seconds. 

This moderate adoption rate avoids the need to introduce 
even more aggressive content discovery algorithms. Sim- 
ulated workloads in early experiments (Figure 4 of [14]) 
showed that under high concurrency, CoralCDN might is- 
sue several redundant fetches to an origin server due to 
a race-like condition in its lookup protocol. If multiple 
nodes concurrently get the same key which does not yet ex- 
ist in the index, all concurrent lookups can fail and multiple 
nodes can contact the origin. This race condition is shared 
by most applications which use a distributed hash table 
(both peer-to-peer and datacenter services). But because 
these traces show that the arrival of user requests happens 
over a much longer time-scale than a DHT lookup, this 
race condition does not pose a significant problem. 

Note that it is possible to mitigate this condition. While 
designing a network file system for PlanetLab that sup- 
ported cooperative caching [2]—meant to quickly dis- 
tribute a file in preparation for a new experiment—we 
sought to minimize redundant fetches to the file server. We 
extended Coral’s insert operation to provide return status 
information, like test-and-set in shared-memory systems. 
A single put+get both returns the first values it encoun- 
tered in the DHT, as well as inserts its own values at an 
appropriate location (for a new key, this would be at its 
closest node). This optimization comes at a subtle cost, 
however, as it now optimistically inserts a node’s identity 
even before that proxy begins downloading the file! If the 
origin fetch fails—a greater possibility in CoralCDN’s en- 
vironment than with a managed file server—then the use of 
these index entries degrades performance. Thus, after us- 
ing this put+get protocol in CoralCDN for several months 
during 2005, we discontinued its use. 


CoralCDN’s openness permits users to quickly leverage 
its resources under load, and its more complex coordina- 
tion helps mitigate these flash crowds and mask temporary 
server unavailability. Yet this very openness led to varied 
usage, the majority of which does not require CoralCDN’s 
more complex design. As we will see, this openness also 
introduces other problems. 


4 Lessons for the Web 


CoralCDN’s naming technique provides an open API for 
CDN services that can transparently work for almost any 
website. Over the course of its deployment, clients and 
servers have used this API to adopt CoralCDN as an elas- 
tic resource for content distribution. Through completely 
automated means, work can be dynamically expanded out 
to use CoralCDN when websites require additional band- 
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width resources, and it can be contracted back when flash 
crowds abate. In doing so, its use presaged the notion of 
“surge computing” with public cloud platforms. But these 
naming techniques and CoralCDN’s open design introduce 
a number of web security problems, many of which are en- 
gendered by a lack of explicitness for specifying protection 
domains. We discuss these issues here. 


4.1 AnAPI for elastic CDN services 


We believe that the central reason for CoralCDN’s adop- 
tion has been its simple user interface and open design. 


Interface design. While superficially obvious, Coral- 
CDN’s interface design achieves several important goals: 


e Transparency: Work with unmodified, unconfigured, 
and unaware web clients and webservers. 


¢ Deep caching: Retrieve embedded images or links 
automatically through CoralCDN when appropriate. 


¢ Server control: Not interfere with sites’ ability to per- 
form usage logging or otherwise control how their 
content is served (e.g., via CoralCDN or directly). 


e Ad-friendly: Not interfere with third-party advertis- 
ing, analytics, or other tools incorporated into a site. 


¢ Forward compatible: Be amenable to future end-to- 
end security mechanisms for content integrity or other 
end-host deployed mechanisms. 


Consider an alternative and even simpler interface de- 
sign [11, 25, 29], in which one embeds origin URLs into 
the HTTP path, e.g., http://nyud.net/example. 
com/. Not only is HTTP parsing simpler, but nameservers 
would not need to synthesize DNS records on the fly (un- 
like our DNS servers for *.nyud.net). Unfortunately, 
while this interface can be used to distribute individual ob- 
jects, it fails on entire webpages. Any relative links would 
lack the example .com prefix that a proxy needs to iden- 
tify its origin. One alternative might be to try to rewrite 
pages to add such links, although active content such as 
javascript makes this notoriously difficult. Further, such 
active rewriting impedes a site’s control over its content, 
and it can interfere with analytics and advertisements. 

CoralCDN’s approach, however, interprets relative links 
with respect to a page’s Coralized hostname, and thus 
transparently requests these objects through it as well. 
But all absolute URLs continue to point to their origin 
sites, and third-party advertisements and analytics remain 
largely unaffected. Further, as CoralCDN does not mod- 
ify content, content also may be amenable to verification 
through end-to-end content signatures [30, 35]. 

In short, it was important for adoption that site owners 
retain sufficient control over how their content is displayed 
and accessed. In fact, our predicted usage scenario of sites 
publishing Coralized URLs proved to be less popular than 
that of dynamic redirection (which we did not foresee). 
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An API for dynamic adoption. CoralCDN was envi- 
sioned with manual URL manipulation in mind, whether 
by publishers editing HTML, users typing Coralized 
URLs, or third-parties posting links. After deployment, 
however, users soon began treating CoralCDN’s interface 
as an API for accessing CDN services. 

On the client side, these techniques included simple 
browser extensions that offer “right-click” options to Cor- 
alize links or that provide a link when a page appears un- 
available. They ranged to more complex integration into 
frameworks like Firefox’s Greasemonkey [21]. Grease- 
monkey allows third-party developers to write site-specific 
javascript code that, once installed by users, manipulates a 
site’s HTML content (usually through the DOM interface) 
whenever the user accesses it. Greasemonkey scripts for 
CoralCDN include those that automatically rewrite links 
on popular portals, or modify articles to include tooltips or 
additional links to Coralized URLs. CoralCDN also has 
been integrated directly into a number of client-side soft- 
ware packages for podcasting. 

The more interesting cases of CoralCDN integration are 
on the server-side. One common strategy is for the origin 
to receive the initial request, but respond with a 302 redi- 
rect to a Coralized URL. This can work well even for flash 
crowds, as the overhead of generating redirects is modest 
compared to that of actually serving the content. 

Generating such redirects can be done by installing a 
server plugin and writing a few lines of configuration code. 
For example, the complete dynamic redirection rule using 
Apache’s mod_rewrite plugin is as follows. 

RewriteEngine on 

RewriteCond %{HTTP_USER_AGENT} !*CoralWebPrx 

RewriteCond %{QUERY_STRING} ! (*|&)coral-no-serves 


RewriteRule *(.4)S http://S{HTTP_HOST} .nyud.net 
61 REOUBST URI} [R,;L] 


Still, redirection rules must be crafted carefully. In this 
example, the second line checks whether the client is a 
CoralCDN proxy and thus should be served directly. Oth- 
erwise, a redirection loop potentially could be formed (al- 
though proxies prevent this from happening by checking 
for potential loops and returning errors if one is found). 

Amusingly, some early users during CoralCDN’s de- 
ployment caused recursion in a different way—and a form 
of amplification attack—by submitting URLs with a long 
string of nyud.net’s appended to a domain. Before 
proxies checked for such conditions, this single request 
caused a proxy to issue a number of requests, stripping 
the last instance of nyud. net off in each iteration. 

While the above rewriting rule applies for all requests, 
other sites incorporate redirection in more inventive ways, 
such as only redirecting clients arriving from particular 
high-traffic referrers: 

RewriteCond %*{HTTP_REFERER} slashdot\.org [NC,OR] 


RewriteCond %{HTTP_REFERER} digg\.com [NC, OR] 
RewriteCond %*{HTTP_REFERER} blogspot\.com [NC] 


USENIX Association 


USENIX Association 


And most interestingly, some sites have even combined 
such tools with server plugins that monitor server load and 
bandwidth use, so that their servers only start rewriting re- 
quests under high load conditions. 

Websites therefore used CoralCDN’s naming technique 
to leverage its CDN resources in an elastic fashion. Based 
on feedback from users, we expanded this “API” to give 
sites some simple control over how CoralCDN should han- 
dle their requests. For example, webservers can include 
X-Coral-Control response headers, which are saved 
as cache meta-data, to specify whether CoralCDN proxies 
should “redirect home” domains that exceed their band- 
width limits (per §5.2) or just return an error as is standard. 


4.2 Security and resource protection 


A number of security mechanisms curtailed the misuse of 
CoralCDN. We highlight the design principle for each. 


4.2.1 Limiting functionality 


CoralCDN proxies have only ever supported GET and 
HEAD requests. Many of the attacks for which “open” 
proxies are infamous [24] are simply not feasible. For ex- 
ample, clients cannot use CoralCDN to POST passwords 
for brute-force cracking. Proxies do not support CON- 
NECT requests, and thus they cannot be used to send spam 
as SMTP relays or to forge “From” addresses in web mail. 
Proxies do not support HTTPS and they delete all HTTP 
cookies sent in headers. These proxies thus provide mini- 
mal application functionality needed to achieve their goals, 
which is cooperatively serving cacheable content. 

CoralCDN’s design had several unexpected conse- 
quences. Perhaps most interestingly, given CoralCDN’s 
multi-layer caching architecture, attempting to crawl or 
brute-force attack a website via CoralCDN is quite slow. 
New or randomly-selected URLs first require a DHT 
lookup to fail, which serves to delay requests against an 
origin website, in much the same way that ssh “tarpits” de- 
lay responses to failed login attempts. In addition, because 
CoralCDN only handles explicit Coralized URLs, it cannot 
be used by simply configuring a vanilla browser’s proxy 
settings. Further, CoralCDN cannot be used to anony- 
mously launch attacks, as it eschews anonymity. Proxies 
use unique USer-—Agent strings (“CoralWebP rx”’) and 
include their identity in Via headers, and they report an 
instigating client’s IP address to the origin server (in an 
X-Forwarded-For request header). We can only sur- 
mise whether the combination of these properties played 
some role, but CoralCDN has seen little abuse as a plat- 
form for proxying server attacks. 


4.2.2 Curtailing excessive resource use 


CoralCDN’s major limiting resource is aggregate band- 
width. The system employs fair-sharing mechanisms to 
balance bandwidth consumption between origin domains, 


which we discuss further in §5.2. In addition to monitoring 
server-side consumption, proxies keep a sliding window of 
client-side usage. Not only do we seek to prevent exces- 
sive bandwidth consumption by clients, but also an exces- 
sive number of (even small) requests. These are caused 
typically by server misconfigurations that result in HTTP 
redirection loops (per §4.1) or by “bot” misuse as part of 
a brute-force attack. While CoralCDN’s limited function- 
ality mitigates such attacks, one notable brute-force login 
attempt took advantage of poor security at a top-5 website, 
which used cleartext passwords over GET requests. 

Given both its storage and bandwidth limitations, Coral- 
CDN enforces a maximum file size of 50 MB. This 
has generally prevented clients from using CoralCDN for 
video distribution, a pragmatic goal when deploying prox- 
ies on university-hosted PlanetLab servers. We found 
that sites attempted to circumvent these limits by omit- 
ting Content-Length headers (on connections marked 
as persistent and without chunked encoding). To ensure 
compliance, proxies now monitor ongoing transfers and 
halt (and blacklist) any ones that exceed their limits. This 
skepticism is needed as proxies interact with potentially 
untrusted servers, and thus must enforce complete media- 
tion [33] to their resources (in this case, bandwidth). 


4.2.3. Blacklisting domains and offloading security 


We maintain a global blacklist for blocking access to spec- 
ified origin domain names. Each proxy regularly fetches 
and reloads the blacklist. This is a practical, but not fun- 
damental, necessity, employed to prevent CoralCDN’s de- 
ployment sites from restricting its use. Parties that request 
blacklisting typically cite one of the following reasons. 


Suspected phishing. Websites have been concerned that 
CoralCDN is—or will be confused with—a phishing site. 
After all, both appear to be “scraping” content and publish 
a simulacrum under an alternate domain. The difference, 
of course, 1s that CoralCDN is serving the site’s content 
unmodified, yet the web lacks any protocol to authenticate 
the integrity of content (as in S-HTTP [30]) in order to ver- 
ify this. As SSL only authenticates identity, websites must 
typically include CDNs in their trusted computing base. 


Potential copyright violation. Typically following a 
DMCA take-down notice, third-parties report that copy- 
righted material may be found on a Coralized domain and 
want it blocked. This scenario is mitigated by CoralCDN’s 
explicit naming—which preserves the name of the actual 
origin in question—and by its caching design. Once con- 
tent is removed from an origin server, it is evicted auto- 
matically from CoralCDN in at most 24 hours. This is a 
natural implication of its goal of handling flash crowds, 
rather than providing long-term availability. 


Circumventing access-control restrictions. Some do- 
mains mediate access to their website via IP-based authen- 


NSDI ’10: 7th USENIX Symposium on Networked Systems Design and Implementation 103 


104 


tication, whereby requests from particular IP prefixes are 
granted access. This practice is especially common for on- 
line academic journals, in order to provide easy access for 
university subscribers. But open proxies within whitelisted 
prefixes would enable external clients to circumvent these 
access-control restrictions. 

By offloading policing to their customers, sites unnec- 
essarily enlarge their security perimeter to include their 
customer’s networks. This scenario is common yet unnec- 
essary. Recall that CoralCDN proxies do not hide their 
identities, and they include the originating client’s IP ad- 
dress in standard request headers. Thus, origin sites can re- 
tain IP-based authentication while verifying that a request 
does not originate from outside allowed prefixes.* Sites 
are just not making use of this information, and thus fail to 
properly mediate access to their protected resources.” 

We did encounter some interesting attacks on our 
domain-based blacklists, akin to fast-flux networks. An 
adversary created dynamic DNS records for a random do- 
main that pointed to the IP address of a target domain (an 
online academic journal). The random domain naturally 
was not blacklisted by CoralCDN, and the content was 
successfully downloaded from the origin target. Such a 
circumvention technique would not have worked if the ori- 
gin site checked either proxy headers (as above) or even 
just the Host field of the HTTP request. The Host cor- 
responded to the fast-flux attack domain, not that of the 
journal. Again, this security hole demonstrates a lack of 
explicit verification and fail-safe defaults [33]. 


4.3 Security and naming conflation 


We argued that CoralCDN’s naming provided a powerful 
API for accessing CDN services. Unfortunately, its tech- 
nique has serious implications as the Web’s Same Origin 
Policy (SOP) conflates naming with security. 

Browsers use domain names for three purposes. (1) Do- 
mains specify where to retrieve content after they are re- 
solved to IP addresses, precisely how CoralCDN enacts 
its layer of indirection. (2) Domains provide a human- 
readable name for what administrative entity a client is 
interacting with (e.g., the “common name”’ identified in 
SSL server certificates). (3) Domains specify what security 
policies to enforce on web objects and their interactions. 

The Same Origin Policy specifies how scripts and in- 
structions from an origin domain can access and modify 


4This does not address the corner case in which the original request 
comes from an IP address within that prefix, while subsequent ones that 
access the then-cached content do not. This can be handled typically by 
marking content as not cacheable, or by having a proxy include headers 
that explicitly specify its client population (i.e., as “open” or by IP prefix). 

>One might argue that sites use a pure IP-based filtering approach 
given its ability to be implemented in layer-3 front-end load balancers. 
But this is not a simple firewall problem, as sites also permit access for 
individual users that login with the appropriate credentials. The sites with 
which we communicated implemented such authorization logic either di- 
rectly in webservers or in complex, layer-7 front-end appliances. 
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browser state. This policy applies to manipulating cookies, 
browser windows, frames, and documents, as well as to 
accessing other URLs via an XmlHttpRequest. At its sim- 
plest level, all of these behaviors are only allowed between 
resources that belong to an identical origin domain. This 
provides security against sites accessing each others’ pri- 
vate information kept in cookies, for example. It also pre- 
vents websites that run advertisements (such as Google’s 
AdSense) from easily performing click fraud to pay them- 
selves advertising dollars by programmatically “clicking” 
on their site’s advertisements.°® 

One caveat to the strict definition of an identical ori- 
gin [18] is that it provides an exception for domains 
that share the same domain.tld suffix, in that www. 
example.com can read and set cookies for example. 
com. This has bad implications for CoralCDN’s naming 
strategy. When example.com is accessed via Coral- 
CDN, it can manipulate all nyud.net cookies, not just 
those restricted to example.com.nyud.net.’ Con- 
cerned with the potential privacy violations from this sce- 
nario, CoralCDN deletes all cookies from headers. 

Unfortunately, many websites now manage cookies via 
javascript, so cookie information can still “leak” between 
Coralized domains on the browser. This happens of- 
ten without a site’s knowledge, as sites commonly use a 
URL’s domain.t1d without verifying its name. Thus, 
if the Coralized example.com writes nyud.net cook- 
ies, these will be sent to evil.com.nyud.net if the 
client visits that webpage. Honest CoralCDN proxies will 
delete these cookies in transit, but attackers can still cir- 
cumvent this problem. For example, when a client vis- 
its evil.com.nyud.net, Javascript from that page can 
access nyud.net cookies, then issue a XmlHttpRequest 
back to evil.com.nyud.net with cookie information 
embedded in the URL. Similar attacks are possible against 
other uses of the SOP, especially as it relates to the abil- 
ity to access and manipulate the DOM. Note that these at- 
tack vectors exist even while CoralCDN operates on fully- 
trusted nodes, let alone more peer-to-peer environments! 

Rather than conclude that CoralCDN’s domain manipu- 
lation is fundamentally flawed, we argue that better adher- 
ence to security principles is needed. Websites are partially 
at fault because they default access to domain.tld suf- 
fixes too readily, as opposed to stripping the minimal num- 
ber of domain prefixes: a violation of the principle of least 
information. An alternative solution that embraces least 


This is prevented because advertisements like AdSense load in an 
iframe that the parent document—the third-party website that stands to 
gain revenue—cannot access, as the frame belongs to a different domain. 

7Commercial CDNs like Akamai are typically not susceptible to such 
attacks, as they generally use a separate top-level domains for each cus- 
tomer, as opposed to CoralCDN’s suffix-based approach. Unlike Coral- 
CDN’s zero configuration, however, such designs require that origins 
preestablish an operational relationship with their CDN provider and 
point their domain to the CDN service (e.g., by aliasing their domain 
to the CDN through CNAME records in DNS). 
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privilege (and has much better incremental deployability) 
would be to allow sources of content to explicitly constrain 
default security policies. As one simple example, when 
serving content for some origin.t1d, proxies could use 
HTTP response headers to specify that the most permis- 
sive domain should be origin.tld.domain.t1ld, not 
their own domain.tld. Interestingly, HTML 5, Flash, 
and various javascript hacks [6] are all exploring methods 
to expand explicit cross-domain communication.’ Both 
proposals avow that the SOP is insufficient and should be 
adapted to support more flexible control through explicit 
rules; ours just views its corner cases as too permissive, 
while the other views its implications as too restrictive. 


5 Lessons for CDNs 


Unlike most commercial counterparts, CoralCDN is de- 
signed to interact with overloaded or poorly-behaving ori- 
gin servers. Further, while commercial systems will grow 
their networks based on expected use (and hence revenue), 
the CoralCDN deployment is comprised of volunteer sites 
with fixed, limited bandwidth. This section describes how 
we adapted CoralCDN to satisfy these realities. 


5.1 Designing for faulty origins 


Given its design goals, CoralCDN needs to react to non- 
crash failures at origin servers as the rule, not the excep- 
tion. Thus, one design philosophy that has come to govern 
CoralCDN’s behavior 1s that proxies should accept content 
conservatively and serve results liberally. 

Consider the following, fairly common, situation. A 
portal runs a story that links to a third-party website, driv- 
ing a sudden influx of readers to this previously unpopular 
site. A user then posts a Coralized link to the third-party 
site as a “comment” to the portal’s story, providing an al- 
ternate means to fetch the content. 

Several scenarios are possible. (1) The website’s origin 
server becomes unavailable before any proxy downloads 
its content. (2) CoralCDN already has a copy of the con- 
tent, but requests arrive to it after the content’s expiry time 
has passed. Unfortunately, subsequent HTTP requests to 
the origin webserver result in failures or errors. (3) Or, 
CoralCDN’s content is again expired, but subsequent re- 
quests to the origin yield only partial transfers. CoralCDN 
employs different mechanisms to handle these failures. 


Cache negative service results (#1). CoralCDN may 
be hit with a flood of requests for an inaccessible URL, 
e.g., DNS resolution fails, TCP connections timeout, etc. 
For these situations, proxies maintain a local negative re- 
sult cache about repeated failures. Otherwise, both prox- 
ies and their local DNS resolvers have experienced re- 


’This is in reaction to the common practice of inserting third-party ob- 
jects into a document’s namespace via <script >—and thus sacrificing 
security protections—as the SOP does not permit a middle ground. 


source exhaustion, given flash crowds to apparently dead 
sites. (While negative result caching has also long been 
part of some DNS implementations [19], it is not universal 
and does not extend to TCP or application-level failures.) 
While more a usability issue, CoralCDN still receives re- 
quests for some Coralized URLs several years after their 
origins became unavailable. 


Serve stale content if origin faulty (#2). CoralCDN 
seeks to avoid replacing good content with bad. As its 
proxies mostly obey content expiry times specified in 
HTTP headers,” if cached content expires, proxies perform 
a conditional request (I f-Modified—Since) to revali- 
date or update expired content. Overloaded origin servers 
might fail to respond or might return some temporary error 
condition (data in §7 shows this to occur in about 0.5% of 
origin requests). Rather than retransmit this error, Coral- 
CDN proxies return the stale content and continue to retain 
it for future use (for up to 24 hours after it expires). 


Prevent truncations through whole-file overwrites (#3). 
Rather than not responding or returning an error, what if a 
revalidation yields a truncated transfer? This is not uncom- 
mon during a flash crowd, as a CoralCDN proxy will be 
competing for a webserver’s resources. Rather than have 
proxies lose stale yet complete versions of objects, proxies 
implement whole-file overwrites in the spirit of AFS [16]. 
Namely, if a valid web object is already cached, the new 
version is written to a temporary file. Only after the new 
version completes downloading and appears valid (based 
on Content-Length) will a proxy replace the old one. 


These approaches are not fail-proof, limited by both se- 
mantic ambiguity in status directives and inaccuracies with 
their use. In terms of ambiguity, does a 403 (Forbidden) 
response code signify that a publisher seeks to make the 
content unavailable (permanent), or is it caused by a web- 
site surpassing its daily bandwidth limits and having re- 
quests rejected (temporary)? Does a 404 (File Not Found) 
code indicate whether the condition is permanent (due to 
a DMCA take-down notice) or temporary (from a PHP or 
database error)? On the other hand, the application of sta- 
tus directives can be flawed. We often found websites to 
report human-readable errors in HTML body content, but 
with an HTTP status code of 200 (Success). This scenario 
leads CoralCDN to replace valid content with less useful 
information. We hypothesize that bad defaults in scripting 
languages such as PHP are partially to blame. Instead of 
being fail-safe, the response code defaults to success. 

Even if transient errors were properly identified, for how 
long should CoralCDN serve expired content? HTTP lacks 


’Proxies in our deployment are configured with a minimum ex- 
piry time of some duration (five minutes), and thus do not recognize 
No-Cache directives as such. Because CoralCDN does not support 
cookies, SSL bridging, or POSTs, however, many of the privacy concerns 
associated with caching such content are alleviated. 
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the ability to specify explicit policy for handling expired 
content. Akamai defaults to a fail-safe scenario by not re- 
turning stale content [22], while CoralCDN seeks to bal- 
ance this goal with availability under server failures. As 
opposed to only using the system-wide default of 24 hours, 
CoralCDN recently enabled its users to explicitly specify 
their policy through max—stale response headers.!° 

These examples all point to another lesson that governs 
CoralCDN’s proxy design: Maintain the status quo unless 
improvements are possible. 


Decoupling service dependencies. A similar theme of 
only improving the status quo governs CoralCDN’s man- 
agement system. CoralCDN servers query a centralized 
management point for a number of tasks: to update their 
overall run status, to start or stop individual service compo- 
nents (HTTP, DNS, DHT), to reinstall or update to a new 
software version, or to learn shared secrets that provide 
admission control to its DHT. Although designed for inter- 
mittent connectivity, one of CoralCDN’s significant out- 
ages came when the management server began misbehav- 
ing and returning unexpected information. In response, we 
adopted what one might call fail-same behavior that ac- 
cepts updates conservatively, an application of decoupling 
techniques from fault-tolerant systems. Management in- 
formation is stored durably on servers, maintaining their 
status-quo operation (even across local crashes) until well- 
formed new instructions are received. 


5.2 Managing oversubscribed bandwidth 


While commercial CDNs and computing platforms often 
respond to oversubscription by acquiring more capacity, 
CoralCDN’s deployment on PlanetLab does not have that 
luxury. Instead, the service must manage its bandwidth 
consumption within prescribed limits. This adoption of 
bandwidth limits was spurred on by administrative de- 
mands from its deployment sites. Following the Asian 
tsunami of December 2004, and with YouTube yet to be 
created, CoralCDN distributed large quantities of amateur 
videos of the natural disaster. With no bandwidth restric- 
tions on PlanetLab at the time, CoralCDN’s network traf- 
fic to the public Internet quickly spiked. PlanetLab sites 
threatened to pull their servers off the network if such 
use could not be curtailed. It was agreed that CoralCDN 
should restrict its usage to approximately 10 GB per day 
per server (i.e., per PlanetLab sliver). 

Several design options exist for limiting bandwidth con- 
sumption. A proxy could simply shut down after exceed- 
ing a configured daily capacity (as supported by Tor [12]). 
Or it could rate-limit its traffic to prevent transient conges- 
tion (as done by BitTorrent and Tor). But as CoralCDN 


'OHTTP/1.1 supports max-stale request headers, although we are not 
aware of their use by any HTTP clients. Further, as proxies often evict 
expired content from their caches, it is unclear whether such request di- 
rectives can be typically satisfied. 
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Figure 14: Requests per domain and number of 403 rejections. 


primarily provides a service for websites, as opposed to 
clients, we chose to allocate its limited bandwidth in a way 
that both preserves some notion of fairness across its cus- 
tomer domains and maintains its central goal of handling 
flash crowds. The technique we developed is more broadly 
applicable than just PlanetLab and federated testbeds: to 
P2P deployments where users run peers within resource 
containers, to multi-tenant datacenters sharing resources 
between their own services, or to commercial hosting en- 
vironments using billing models such as 95th-%ile usage. 

Providing per-domain fairness might be resource inten- 
sive or difficult in the general case, given that CoralCDN 
interacts with 10,000s of domains each day, but our highly- 
skewed workloads greatly simplify the necessary account- 
ing. Figure 14 shows the total number of requests per 
domain that CoralCDN received over one day (the solid 
top line). The distribution clearly has some very pop- 
ular domains—the most popular one (a Tamil clone of 
YouTube) received 2.6M requests—while the remaining 
distribution fell off in a Zipf-like manner. (Note that Fig- 
ure 6 was in terms of unique URLs, not unique domains.) 
Given that CoralCDN’s traffic is dominated by a limited 
number of domains, its mechanisms can serve mainly to 
reject requests for (i.e., perform admission control on) 
these bandwidth hogs. Still, CoralCDN should differenti- 
ate between peak limits and steady-state behavior to allow 
for flash crowds or changing traffic patterns. 

To achieve these aims, each CoralCDN proxy imple- 
ments an algorithm that attempts to simultaneously (1) 
provide a hard-upper limit on peak traffic per hour (con- 
figured to 1000 MB per hour per proxy), (2) bound the 
expected total traffic per epoch in steady state (400 MB 
per hour per proxy), and (3) bound the steady-state limit 
per domain. As setting this last limit statically—such as 
1/k-th of the total traffic if there are k popular domains— 
would lead to good fairness but poor utilization (given the 
non-uniform distribution across domains), we dynamically 
adjust this last traffic limit to balance this trade-off. 

During each hour-long epoch, a proxy records the total 
number of bytes transmitted for each domain. It also cal- 
culates domains’ average bandwidth as an exponentially- 
weighted moving average (attenuated over one week), as 
well as the total average consumption across all domains. 
This long attenuation period provides long-term fairness— 
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and most consumption is long-term, as shown in Fig- 
ure 7—but also emphasizes support for short-term flash 
crowds. Across epochs, bandwidth usage is only tracked, 
and durably stored, for the top-100 domains. If a domain 
is not currently one of the top-100 bandwidth consumers, 
its historical average bandwidth is set to zero (providing 
additional leeway to sites experiencing flash crowds). 


When a requested domain is over its hourly budget (case 
3 above), CoralCDN proxies respond with 403 (Forbidden) 
messages. If instead the proxy is over its peak or steady- 
state limit calculated over all domains (cases 1 or 2 above), 
then the proxy redirects the client back to the origin site, 
and the proxy temporarily makes itself unavailable for new 
client requests, which would be rejected anyway. !! 

By applying these mechanisms, CoralCDN reduces its 
bandwidth consumption to manageable levels. While its 
demand sometimes exceeds 10 TBs per day (aggregate 
across all proxies), its actual HTTP traffic remains steady 
at about 2 TB per day after rejecting a significant number 
of requests. The scatter plot in Figure 14 shows the num- 
ber of requests resulting in 403 responses per domain, most 
due to these admission control mechanisms. We see how 
variances in domains’ object sizes yield different rejection 
rates. The second-most popular domain serves mostly im- 
ages smaller than 10 KB and experiences a rejection rate of 
3.3%. Yet the videos of the third-most popular domain— 
user-contributed screensavers of fractal flames—are typi- 
cally 5 MB in size, leading to an 89% rejection rate. 


Note that we could significantly curtail the use of Coral- 
CDN as a long-term CDN provider (see §3.2) through sim- 
ple changes to these configuration settings. A low steady- 
state limit per domain, coupled with a greater weight on 
a domain’s historic averages, devotes resources to flash- 
crowd relief at the exclusion of long-term consumption. 


Admittedly, CoralCDN’s approach penalizes an origin 
site with more regional access patterns. Bandwidth ac- 
counting and admission control is performed indepen- 
dently on each node, reflecting CoralCDN’s lack of cen- 
tralization. By not sharing information between nodes 
(provided that DNS resolution preserves locality), a site 
with regional interest can be throttled before it reaches its 
fair share of global capacity. While this does not pose 
an operational problem for CoralCDN, it is an interest- 
ing research problem to perform (approximate) accounting 
across the network that is both decentralized and scalable. 
Distributed Rate Limiting [28] considered a related prob- 
lem, but focused on instantaneous limits (e.g., Mbps) in- 
stead of long-term aggregate volumes and gossiped state 
that is linear in both the number of domains and nodes. 

"Tf clients are redirected back to the origin, a proxy appends the query- 
string coral—no-serve on the location URL returned to the client. 
Origins that use redirection scripts with CoralCDN check for this string to 
prevent loops, per §4.1. Although not the default, operators of some sites 


preferred this redirection home even if their domain was to blame (a pol- 
icy they can specify through a X-Coral-—Control response header). 
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Figure 15: RPC RTTs to various levels of Coral’s DHT hierarchy. 


5.3. Managing performance jitter 


Running on an oversubscribed deployment platform, 
CoralCDN developed several techniques to better han- 
dle latency variations. With PlanetLab services facing 
high disk, memory, and CPU contention, and sometimes 
additional traffic shaping in the kernel, applications can 
face both performance jitter and prolonged delays. These 
performance variations are not unique to PlanetLab, and 
they have been well documented across a variety of set- 
tings. For example, Google’s MapReduce [10] took run- 
time adaption of cluster query processing [3] to the large- 
scale, where performance variations even among homo- 
geneous components required speculative re-execution of 
work. More recently, studies of a MapReduce clone on 
Amazon’s EC2 underscored how shared and virtualized 
platforms provide new performance challenges [39]. 

CoralCDN saw the implications of performance vari- 
ations most strikingly with its latency-sensitive self- 
organization. For example, Coral’s DHT hierarchy was 
based on nodes clustering by network RTTs. A node would 
join a cluster provided some minimum fraction (85%) of 
its members were below the specified threshold (30 ms for 
level 2, 80 ms for level 1). Figure 15 shows the RTTs for 
RPC between Coral nodes, broken down by levels (with 
vertical lines added at 30ms, 80ms, and Is). While the 
clustering algorithms achieve their goals and local clusters 
have lower RTTs, the heavy tail in all CDFs 1s rather strik- 
ing. Fully 1% of RPCs took longer than | second, even 
within local clusters. Coral’s use of concurrent RPCs dur- 
ing DHT operations helped mask this effect. 

Another lesson from CoralCDN’s deployment was the 
need for stability in the face of performance variations. 
This translated to the following rule in Coral. A node 
would switch to a smaller (and hence less attractive) cluster 
only if fewer than 70% of a cluster’s members now satisfy 
its threshold, and form a singleton only if fewer than 50% 
of neighbors are satisfactory. In other words, the barrier to 
enter a cluster is high (85%), but once a member, it’s eas- 
ier to remain. Before leveraging this form of hysteresis, 
cluster oscillations were much more common, which led 
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Figure 16: Comparison of PlanetLab’s accounting of all upstream 
traffic, PlanetLab’s count to non-PlanetLab destinations, and Coral- 
CDN’s accounting through HTTP logs. 


to many stale DHT references. A related use of hystere- 
sis within self-organizing systems helped improve virtual 
network coordinate systems for both PlanetLab [26] and 
Azureus [20], as well as failure recovery in Bamboo [31]. 


6 Lessons for Platforms 


With the growth of virtualized hosting and cloud deploy- 
ments, Internet services are increasingly running on third- 
party infrastructure. Motivated by CoralCDN’s deploy- 
ment on PlanetLab, we discuss some benefits from im- 
proving an application’s visibility into and control over its 
lower layers. We first revisit CoralCDN’s bandwidth man- 
agement from the perspective of fine-grained service dif- 
ferentiation, then describe tackling its fault-tolerance chal- 
lenge with adequate network support. 


6.1 Exposing information and expressing 
preferences across layers 


We described CoralCDN’s bandwidth management as self- 
regulating, which works well in trusted environments. But 
many resource providers would rather enforce restrictions 
than assume applications behave well. Indeed, in 2006, 
PlanetLab began enforcing average daily bandwidth limits 
per node per service (i.e., per PlanetLab “sliver’). When 
a sliver hits 80% of its limit—17.2 GB/day from each 
server to the public Internet—the kernel begins enforcing 
bandwidth caps (using Linux’s Hierarchical Token Bucket 
scheduler) as calculated over five-minute epochs. 

We now have the possibility of two levels of bandwidth 
management: admission control by CoralCDN proxies and 
rate limiting by the underlying hosting platform. Interest- 
ingly, even though CoralCDN uses a relatively conserva- 
tive limit for itself (10 GB/day per sliver), it still surpasses 
the 80% mark (13.8 GB) on 5—10 servers per day (out of its 
300-400 servers). The main cause of this overage is that, 
while CoralCDN counts only successful HTTP responses, 
its hosting platform accounts for all traffic—HTTP, DNS, 
DHT RPCs, log transfers, packet headers, retransmissions, 
etc.—generated by its sliver. Figure 16 shows the differ- 
ence in these recorded values for the week of Sept 16, 
2009. We see that kernel statistics were 50%-90% higher 
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than CoralCDN’s accounting. This problem of accurate 
accounting is a general one, as it is difficult or expensive 
to collect such data in user-space.!* And even accurate in- 
formation does not prevent CoralCDN’s managed HTTP 
traffic from competing for network resources with the rest 
of its sliver’s unmanaged traffic. 


We argue that hosting platforms should provide better 
visibility and control. First, these platforms should export 
greater information to higher levels, such as their current 
measured resource consumption in a machine-readable 
format and in real time. Second, these platforms should 
allow applications to push policies into lower levels, i.e., 
an application’s explicit preferences for handling differ- 
ent classes of resources. For the specific case of network 
resources, the platform kernel could apply priorities on a 
granularity finer that just per-sliver, akin to a form of end- 
host DiffServ; CoralCDN would prioritize DNS and DHT 
traffic over HTTP traffic, in turn over log maintenance. 


Note that we are concerned with a different type of re- 
source management than that provided by VM hypervisors 
or kernel resource containers [4]. Those systems focus 
on short-term resource isolation or prioritized scheduling 
between applications, and typically reason about coarse- 
grain VM-level resources. Our focus instead is on long- 
term resource accounting. PlanetLab is not unique here; 
commercial cloud-computing providers such as Amazon 
and Rackspace use long-term resource accounting for 
billing purposes. (In fact, Amazon just launched its Cloud- 
Watch service in June 2009 to expose real-time resource 
monitoring on a coarser-grain per-VM basis [1].) Thus, 
providing greater visibility and control would be useful 
not only for deploying applications on platforms with hard 
constraints (e.g., PlanetLab), but also for managing appli- 
cations on commercial platforms so as to minimize costs 
(e.g., in both metered and 95th-%ile billing scenarios). 


6.2 Providing support for fault-tolerance 


A central reliability issue in CoralCDN is due to its boot- 
strapping problem: To initially resolve a Coralized URL 
with no prior knowledge of system participants, a client’s 
resolver must contact one of only 10-12 CoralCDN name- 
servers registered with the .net gTLD servers. If one 
of these nameservers fails—each IP address represents 
a static PlanetLab server—clients experience long DNS 
timeouts. Thus, while CoralCDN internally detects and 
reacts quickly to failure, the same rapid recovery is not 
enjoyed by its primary nameservers registered externally. 
And once legacy clients bind to a particular proxy’s IP 
address—e.g., web browsers cache name-to-IP mapping 
to prevent certain types of “rebinding” attacks on the 


!2Tn fact, even Akamai servers only use an estimate of bandwidth con- 
sumption (their so-called “fully-weighted bits”) when calculating server 
load [22]. Only more recently did PlanetLab expose kernel accounting. 
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Same Origin Policy [9]—CoralCDN cannot recover for 
this client if that proxy fails. 


While certainly observed before, CoralCDN’s reliabil- 
ity challenge underscores the limits of purely application- 
layer recovery, especially as it relates to bootstrapping. In 
the context of DNS-based bootstrapping, several possibil- 
ities exist, including (1) dynamically updating root name- 
servers to reflect changes, e.g., via the rarely-supported 
RFC2136 [36], (2) announcing IP anycast addresses via 
BGP or OSPF, or (3) using transparent network-layer 
failover between colocated nameservers (e.g., ARP spoof- 
ing or VIP/DIP load balancers). [P-level recovery between 
proxies has its own solutions, but most commonly rely on 
colocated servers in LAN environments. None of these 
suggestions are new ones, but they still present a higher 
barrier to entry; PlanetLab did not have any available to it. 


Deployment platforms should strive to provide or ex- 
pose such network functionality to their services. Ama- 
zon EC2’s launch of Elastic IP Addresses in March 2008, 
for example, hid the complexity of ARP spoofing for VM 
environments. The further development of such support 
should be an explicit goal for future deployment platforms. 


7 Conclusions and Looking Forward 


Our retrospective on CoralCDN’s deployment has a rather 
mixed message. We view the adoption of CoralCDN as 
a successful proof-of-concept of how users can and will 
leverage open APIs for CDN services. But many of its ar- 
chitectural features were over-designed for its current en- 
vironment and with its current workload: A much sim- 
pler design could have sufficed with probably better per- 
formance to boot. 


That said, it is a entirely different question as to whether 
CoralCDN provides a good basis for designing an Internet- 
scale cooperative CDN. The service remained tied to Plan- 
etLab because we desired a solution that was backwards 
compatible with both unmodified clients and servers. Run- 
ning on untrusted nodes seemed imprudent at best given 
our inability to provide end-to-end security checks. We 
have shown, however, that even running CoralCDN on 
fully trusted nodes introduces some security concerns. So, 
if we dropped the goal of full backwards compatibility, 
what minimal changes could better support more open, 
flexible infrastructure? 


Naming. CoralCDN’s naming provided a layer of in- 
direction for composing two loosely-coupled Internet ser- 
vices. In fact, one could compose longer series of services 
that each offer different functionality by simply chaining 
together their domain names. While this technique would 
not be safe under today’s Same Origin Policy, we showed 
in §4.3 how a trusted proxy could constrain the default se- 
curity policy. For a participating origin server with an un- 
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Figure 17: Percentage of a proxy’s upstream requests satisfied by 
origin and by peers at various clustering levels when regional coop- 
eration is used, i.e., level-0 peers only serve as a failover from a faulty 
origin. Dataset covers 10-day period from December 9-19, 2009. 
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Figure 18: Change in percentage between regional cooperation pol- 

icy (Figure 17) and CoralCDN’s traditional global peering. Positive 


values correspond to increased hit rates in regional peering. 


trusted CDN, the origin should specify (and sign) its min- 
imally required domain suffix of origin.tld.x. 


Content Integrity. Today’s CDNs are full-fledged mem- 
bers of a website’s trusted computing base. They have free 
reign to return modified content. Often, they can even pro- 
grammatically read and modify any content served directly 
from a customer website to its clients (either by serving 
embedded <script>’s or by playing SOP tricks while 
masquerading as their customer behind a DNS alias). To 
provide content delivery via untrusted nodes, the natural 
solution is an HTTP protocol that supports end-to-end sig- 
natures for content integrity [30]. In fact, even a browser 
extension would suffice to deploy such security [35]. 


Fine-Grain Origin Control. —A tension in this paper 
is between client latency and server load, underscored by 
our varied usage scenarios. An appropriate strategy for 
interacting with a well-provisioned server is a minimal at- 
tempt at cooperation before contacting the origin. Yet, an 
oversubscribed server wants its clients to make a maximal 
effort at cooperation. So far, proxies have used a “one- 
size-fits-all” approach, treating all origins as if they were 
oversubscribed. Instead, much as they have adopted dy- 
namic URL rewriting, origin domains can signal a Coral- 
CDN proxy about their desired policy in-band. At a high- 
level, this argues for a richer API for elastic CDN services. 

To explore the effect of regional cooperation, we 
changed the default lookup policy on about half the de- 
ployed CoralCDN proxies since September 2009. If re- 
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lookup latency (over all hour-long epochs of Dec 9-19, 2009), com- 


paring regional and global cooperation policies. Individual lookups 


were configured with a five-second timeout. 


quested content is not already cached locally, these prox- 
ies only perform lookups within local and regional clusters 
(level 2 and 1) before contacting the origin. For proxies 
operating under such a policy, Figure 17 shows the per- 
centage of upstream requests that were satisfied by the 
origin and at different levels of clusters. Figure 18 de- 
picts the change in behavior compared to the traditional 
global lookup strategy, showing that the 10-12% of re- 
quests that had been satisfied by level-O proxies shifted to 
higher hit rates at both the origin and local proxies.!* This 
change was associated with an order-of-magnitude latency 
improvement for the Coral lookup, shown in Figure 19. 
The global index still provides some benefit to the system, 
however, as per Figure 17, it satisfies an average of 0.56% 
of requests (stddev 0.51%) that failed over from origin 
servers. In summary, system architectures like CoralCDN 
can support different policies that trade-off server load for 
latency, yet still mask temporary failures at origins. 


While perhaps imperfectly suited for a smaller-scale 
platform like PlanetLab, CoralCDN’s architecture pro- 
vides interesting self-organizational and hierarchical prop- 
erties. This paper discussed many of the challenges—in 
security, availability, fault-tolerance, robustness, and, per- 
haps most significantly, resource management—that we 
needed to address during its five-year deployment. We 
believe that its lessons may have wider and more lasting 
implications for other systems as well. 
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5-These graphs also show interesting diurnal patterns, related to a de- 
fault expiry time of 12 hours for content. 
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Abstract 


Whanau is a novel routing protocol for distributed 
hash tables (DHTs) that is efficient and strongly resis- 
tant to the Sybil attack. Whanau uses the social connec- 
tions between users to build routing tables that enable 
Sybil-resistant lookups. The number of Sybils in the so- 
cial network does not affect the protocol’s performance, 
but links between honest users and Sybils do. When there 
are n well-connected honest nodes, Whanau can tolerate 
up to O(n/ log n) such “attack edges”. This means that 
an adversary must convince a large fraction of the honest 
users to make a social connection with the adversary’s 
Sybils before any lookups will fail. 

Whanau uses ideas from structured DHTs to build 
routing tables that contain O(,/n log n) entries per node. 
It introduces the idea of layered identifiers to counter 
clustering attacks, a class of Sybil attacks challenging for 
previous DHTs to handle. Using the constructed tables, 
lookups provably take constant time. Simulation results, 
using social network graphs from LiveJournal, Flickr, 
YouTube, and DBLP, confirm the analytic results. Ex- 
perimental results on PlanetLab confirm that the protocol 
can handle modest churn. 


1 Introduction 


Decentralized systems on the Internet are vulnerable to 
the “Sybil attack”, in which an adversary creates numer- 
ous false identities to influence the system’s behavior [9]. 
This problem is particularly pernicious when the system 
is responsible for routing messages amongst nodes, as in 
the Distributed Hash Tables (DHT) [24] which underlie 
many peer-to-peer systems, because an attacker can pre- 
vent honest nodes from communicating altogether [23]. 

If a central authority certifies identities as genuine, 
then standard replication techniques can be used to for- 
tify these protocols [4,20]. However, the cost of universal 
strong identities may be prohibitive or impractical. In- 
stead, recent work [27, 26,8, 19, 17,5] proposes using the 
weak identity information inherent in a social network 
to produce a completely decentralized system. This pa- 
per resolves an open problem by demonstrating an effi- 
cient, structured DHT that enables honest nodes to reli- 
ably communicate despite a concerted Sybil attack. 

To solve this problem, we build on a social network 
composed of individual trust relations between honest 


people (nodes). This network might come from personal 
or business connections, or it might correspond to some- 
thing more abstract, such as ISP peering relationships. 
We presume that each participant keeps track of its im- 
mediate neighbors, but that there is no central trusted 
node storing a map of the network. 

An adversary can infiltrate the network by creating 
many Sybil nodes (phoney identities) and gaining the 
trust of honest people. Nodes cannot directly distin- 
guish Sybil identities from genuine ones (if they could, 
it would be simple to reject Sybils). As in previous 
work [27], we assume that most honest nodes have more 
social connections to other honest nodes than to Sybils; 
in other words, the network has a sparse cut between the 
honest nodes and the Sybil nodes. 

In the context of a DHT, the adversary cannot pre- 
vent immediate neighbors from communicating, but can 
try to disrupt the DHT by creating many Sybil identities 
which spread misinformation. Suppose an honest node u 
wants to find a key y and will recognize the correspond- 
ing value (e.g., a signed data block). In a typical struc- 
tured DHT, w queries another node which wu believes to 
be “closer” to y, which forwards to another even-closer 
node, and so on until y is found. The Sybil nodes can 
disrupt this process by spreading false information (e.g., 
that they are close to a particular key), then intercept- 
ing honest nodes’ routing queries, and responding with 
“no such key” or delaying queries endlessly. Unstruc- 
tured protocols that work by flooding or gossip are more 
robust against these attacks, but pay a performance price, 
requiring linear time to find a key. 

This paper’s main contribution is Whanau °, a novel 
protocol that is the first solution to Sybil-proof routing 
that has sublinear run time and space usage. Whanau 
achieves this performance by combining the idea of ran- 
dom walks from recent work [26] with a new way of 
constructing IDs, which we call layered identifiers. To 
store up to k keys per node, Whanau builds routing ta- 
bles with O(\/knlogn) entries per node. Using these 
routing tables, lookups provably take O(1) time. Thus, 
Whanau’s security comes at low cost: it scales similarly 
to one-hop DHTs that provide no security [11, 10]. We 
have implemented Whanau in simulation and in a simple 


1 


!Whanau , pronounced “far-no’’, is a Maori word. It is cognate with 
the Hawai’ian word ’ohana, and means “extended family” or “‘kin’. 
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instant-messaging application running on PlanetLab [2]. 
Experiments with real-world social graphs and these im- 
plementations confirm Whanau’s theoretical properties. 

Whanau provides one-hop lookups, but our implemen- 
tation is not aware of network locality. Whanau also must 
rebuild its routing tables periodically to handle churn in 
the social network and in the set of keys stored in the 
DHT. However, its routing tables are sufficiently redun- 
dant that nodes simply going up and down doesn’t impact 
lookups, as long as enough honest nodes remain online. 

The rest of the paper is organized as follows. Section 2 
summarizes the related work. Section 3 informally states 
our goals. Section 4 states our assumptions about the so- 
cial network, and provides a precise definition of “Sybil- 
proof”. Section 5 gives an overview of Whanau’s routing 
table structure and introduces layered IDs. Section 6 de- 
scribes Whanau’s setup and lookup procedures in detail. 
Section 7 states lemmas proving Whanau’s good perfor- 
mance. Section 8 describes Whanau’s implementation on 
PlanetLab [2] and in a simulator. Using these implemen- 
tations Section 9 confirms its theoretical properties by 
simulations on social network graphs from popular In- 
ternet services, and investigates its reaction to churn on 
PlanetLab. Section 10 discusses engineering details and 
ideas for future work, and Section 11 summarizes. 


2 Related work 


Shortly after the introduction of scalable peer-to-peer 
systems based on DHTs, the Sybil attack was recognized 
as a challenging security problem [9, 16, 23,22]. A num- 
ber of techniques [4, 20, 22] have been proposed to make 
DHTs resistant to a small fraction of Sybil nodes, but all 
such systems ultimately rely on a certifying authority to 
perform admission control and limit the number of Sybil 
identities [9,21, 3]. 

Several researchers [17, 19, 8,5] proposed using so- 
cial network information to fortify peer-to-peer systems 
against the Sybil attack. The bootstrap graph model [8] 
introduced a correctness criterion for secure routing us- 
ing a social network and presented preliminary progress 
towards that goal, but left a robust and efficient protocol 
as an open problem. 

Recently, SybilGuard and SybilLimit [27, 26] have 
shown how to use a “fast mixing” social network and 
random walks on these networks (see Section 4.1) to de- 
fend against the Sybil attack in general decentralized sys- 
tems. Using SybilLimit, an honest node can certify other 
nodes as “probably honest”, accepting at most O(log n) 
Sybil identities per attack edge. (Each certification uses 
O(,/n) bandwidth.) For example, SybilLimit’s vetting 
procedure can be used to check that at least one of a set 
of storage replicas is likely to be honest. 

A few papers have adapted the idea of random walks 
for purposes other than SybilLimit. Nguyen et al. used it 
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Figure 1: Overview of Whanau. SETUP builds structured rout- 
ing tables which LOOKUP uses to route queries to keys. 


for Sybil-resilient content rating [25], Yu et al. applied it 
to recommendations [28], and Danezis and Mittal used 
it for Bayesian inference of Sybils [7]. This paper is the 
first to use random walks to build a Sybil-proof DHT. 7 


3 Goals 


As illustrated in Figure 1, the Whanau protocol is a pair 
of procedures SETUP and LOOKUP. SETUP(-) is used 
both to build routing tables and to insert keys. It cooper- 
atively transforms the nodes’ local parameters (e.g. key- 
value records, social neighbors) into a set of routing table 
structures stored at each node. After all nodes complete 
the SETUP phase, any node u can call LOOKUP(u, key) 
to use these routing tables to find the target value. 


3.1 Scenario 


We illustrate Whanau with a simple instant messaging 
(IM) application which we have implemented on Planet- 
Lab. Whanau provides a rendezvous service for the IM 
clients. Each user is identified by a public key, and pub- 
lishes a single self-signed tuple (public key, current IP 
address) into the DHT. ° To send an IM to a buddy iden- 
tified by the public key Pk, aclient looks up PK in the 
DHT, verifies the returned tuple’s signature using PK, 
and then sends a packet to that IP address. 

In our implementation, each user runs a Whanau node 
which stores that user’s record, maintains contact with 
the user’s social neighbors, and contributes to the DHT. 
(In this example, each node stores a single key-value 
record, but in general, there may be an arbitrary number 
k of keys stored per node.) When the user changes loca- 
tion, the client updates the user’s record with the new IP 
address. The user’s DHT node need not be continuously 
available when the user is offline, as long as a substantial 
fraction of honest nodes are available at any given time. 


3.2 Security goal 


Whanau handles adversaries who deviate from the proto- 
col in Byzantine ways: the adversaries may make up ar- 
bitrary responses to queries from honest nodes and may 
create any number of pseudonyms (Sybils) which are in- 
distinguishable from honest nodes. When we say that 


Our workshop paper [14] noted the opportunity and sketched an 
early precursor to Whanau. 

3Of course, a realistic application would require a PKI for human- 
readable names, protection against replays, privacy controls, and so on. 
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Whanau is “Sybil-proof”, we mean that LOOKUP has a 
high probability of returning the correct value, despite 
arbitrary attacks during both the SETUP and LOOKUP 
phases. (Section 4 makes this definition more precise.) 
The adversary can always join the DHT normally 
and insert arbitrary key-value pairs, including a differ- 
ent value for a key already in the DHT. Thus, Whanau 
provides availability, but not integrity: LOOKUP should 
find all values inserted by honest nodes for the speci- 
fied key, but may also return some values inserted by 
the adversary. Integrity is an orthogonal concern of the 
application: for example, the IM application filters out 
any bad values by verifying the signature on the returned 
key-value records, and ignoring records with invalid sig- 
natures. (As an optimization, DHT nodes may opt to dis- 
card bad records proactively, since they are of no use to 
any client and consume resources to store and transmit.) 


3.3. Performance goals 


Simply flooding LOOKUP queries over all links of the 
social network is Sybil-resistant, but not efficient [8]. 
The adversary’s nodes might refuse to forward queries, 
or they might reply with bogus values. However, if there 
exists any path of honest nodes between the source node 
and the target key’s node through the social network, then 
the adversary cannot prevent each of these nodes from 
forwarding the query to the next. In this way, the query 
will always reach the target node, which will reply with 
the correct value. Unfortunately, a large fraction of the 
participating nodes are contacted for every lookup, do- 
ing O(n) work each time. 

On the other hand, known one-hop DHTs are very 
efficient — requiring O(1) messages for lookups and 
O(,/n) table sizes*— but not secure against the Sybil 
attack. Our goal is to combine this optimal efficiency 
with provable security. As a matter of policy and fair- 
ness, we believe that a node’s table size and bandwidth 
consumption should be proportional to the node’s degree 
(1.e., highly connected nodes should do more work than 
casual participants). While it is possible to adapt Wha- 
nau to different policies, this paper assumes that the goal 
is a proportional policy. 


4 Defining “Sybil-proof”’ 


Like previous work [27, 26], Whanau relies on certain 
features of social networks. This section describes our 
assumptions, outlines why they are useful, and defines 
what it means for a DHT to be “Sybil-proof” under these 
assumptions. 


4If n = 5 x 108, the approximate number of Internet hosts in 
2010, then a table of ,/n may be acceptable for bandwidth and storage 
constrained devices, as opposed to a table that scales linearly with the 
number of hosts. 





Figure 2: The social network. A sparse cut (the dashed attack 
edges) separates the honest nodes from the Sybil nodes. The 
Sybil region’s size is not well-defined, since the adversary can 
create new pseudonyms at will. 


4.1 Fast-mixing social networks 


The social network is an undirected graph whose nodes 
know their immediate neighbors. Figure 2 conceptually 
divides the social network into two parts, an honest re- 
gion containing all honest nodes and a Sybil region con- 
taining all Sybil identities. An attack edge is a connec- 
tion between a Sybil node and an honest node. An honest 
edge is a connection between two honest nodes [27]. An 
“honest” node whose software’s integrity has been com- 
promised by the adversary is considered a Sybil node. 


The key assumption is that the number of attack edges, 
g, 1S small relative to the number of honest nodes, n. As 
pointed out by earlier work, one can justify this sparse 
cut assumption by observing that, unlike creating a Sybil 
identity, creating an attack edge requires the adversary 
to expend social-engineering effort: the adversary must 
convince an honest person to create a social link to one 
of his Sybil identities. 


Whanau’s correctness will depend on the sparse cut 
assumption, but its performance will not depend at all 
on the number of Sybils. In fact, the protocol is totally 
oblivious to the structure of the Sybil region. Therefore, 
the classic Sybil attack, of creating many fake identities 
to swamp the honest identities, is ineffective. 


Since we rely on the existence of a sparse cut to dis- 
tinguish the honest region from the Sybil region, we also 
assume that there is no sparse cut dividing the honest 
region in two. Given this assumption, the honest region 
forms an expander graph. Expander graphs are fast mix- 
ing, which means that a short random walk starting from 
any node will quickly approach the stationary distribu- 
tion [6]. Roughly speaking, the ending node of a random 
walk is a random node in the network, with a probability 
distribution proportional to the node’s degree. The mix- 
ing time, w, is the number of steps a random walk must 
take to reach this smooth distribution. For a fast mixing 
network, w = O(logn). Section 9.1 shows that graphs 
extracted from some real social networks are fast mixing. 
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Typical magnitude — Description 


n arbitrary n > | number of honest nodes 

m O(n) number of honest edges 

Ww O(log n) mixing time of honest region 
k arbitrary k > 1/m _ keys stored per (virtual) node 
g O(n/w) number of attack edges 

E O(gw/n) fraction of loser nodes 


Table 1: Social network parameters used in our analysis. 


4.2 Sampling by random walk 


The random walk is Whanau’s main building block, and 
is the only way the protocol uses the social network. An 
honest node can send out a w-step walk to sample a ran- 
dom node from the social network. If it sends out a large 
number of such walks, and the social network is fast m1x- 
ing and has a sparse cut separating the honest nodes and 
Sybil nodes, then the resulting set will contain a large 
fraction of random honest nodes and a small number of 
Sybil nodes [26]. Because the initiating node cannot tell 
which individual samples are good and which are bad, 
Whanau treats all sampled nodes equally, relying only 
on the fact that a large fraction will be good nodes. 

Some honest nodes may be near a concentration of at- 
tack edges. Such loser nodes have been lax about ensur- 
ing that their social connections are real people, and their 
view of the social graph does not contain much informa- 
tion. Random walks starting from loser nodes are more 
likely to escape into the Sybil region. As a consequence, 
loser nodes must do more work per lookup than winner 
nodes, since the adversary can force them to waste re- 
sources. Luckily, only a small fraction of honest nodes 
are losers, because a higher concentration of attack edges 
in one part of the network means a lower concentration 
elsewhere. Most honest nodes will be winner nodes. 

In the stationary distribution, proportionally more ran- 
dom walks will land on high-degree nodes than low- 
degree nodes. To handle high-degree nodes well, each 
Whanau participant creates one virtual node [24] per 
social network edge. Thus, good random samples are 
distributed uniformly over the virtual nodes. All virtual 
nodes contribute equal resources to the DHT and obtain 
equal levels of service (1.e., keys stored/queried). This 
use of virtual nodes fulfils the policy goal (Section 3.3) 
of allocating both workload and trust according to each 
person’s level of participation in the social network. 


4.3. Main security definition 


Table 1 summarizes the social network parameters intro- 
duced thus far. We can now succinctly define our main 
security property: 


Definition. A DHT protocol is (g, €, p)-Sybil-proof if, 
against an active adversary with up to g attack edges, the 
protocol’s LOOKUP procedure succeeds with probability 
> pon any honest key, for at least (1 — €)n honest nodes. 
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Given a (g,€, 1/2)-Sybil-proof protocol, it is always 
possible to amplify the probability of success p exponen- 
tially close to | by, for example, running multiple inde- 
pendent instances of the protocol in parallel. > For exam- 
ple, running 3 log, n instances would reduce the failure 
probability to less than 1/n?, essentially guaranteeing 
that all lookups will succeed with high probability (since 
there are only n? possible source-target node pairs). 

The parameter € represents the fraction of loser nodes, 
which is a function of the distribution of attack edges 
in the network. If attack edges are distributed uniformly, 
then € may be zero; if attack edges are clustered, then a 
small fraction of nodes may be losers. 

We use the parameters in Table | to analyze our proto- 
col, but do not assume that all of them are known by the 
honest participants. Whanau needs order-of-magnitude 
estimates of m, w, and k to choose appropriate table sizes 
and walk lengths. It does not need to know g or e. 

Proving that a protocol is Sybil-proof doesn’t imply 
that it cannot be broken. For example, Whanau is Sybil- 
proof but can be broken by social engineering attacks 
that invalidate the assumption that there is a sparse cut 
between the honest and Sybil regions. Similarly, a pro- 
tocol may be broken by using cryptographic attacks or 
attacks on the underlying network infrastructure. These 
are serious concerns, but these are not the Sybil attack as 
described by Douceur [9]. Whanau’s novel contribution 
is that it is the first DHT protocol totally insensitive to 
the number of Sybil identities. 


5 Overview of Whanau 


This section outlines Whanau’s main characteristics. 


5.1 Challenge 


The Sybil attack poses three main challenges for a struc- 
tured DHT. First, structured DHTs forward queries us- 
ing small routing tables at each node. Simply by creating 
many cheap pseudonyms, an attacker will occupy many 
of these table entries and can disrupt queries [23]. 

Second, a new DHT node builds and maintains its 
routing tables by querying its neighbors’ tables. An at- 
tacker can reply to these queries with only its own nodes. 
Over time, this increases the fraction of table entries the 
attacker occupies [22]. 

Third, DHTs assign random IDs to nodes and apply 
hash functions to keys 1n order to spread load evenly. By 
applying repeated guess-and-check, a Sybil attacker can 
choose its own IDs and bypass these mechanisms. This 
enables clustering attacks targeted at a specific key. For 
example, if the adversary inserts many keys near the tar- 
geted key, then it might overflow the tables of honest 
nodes responsible for storing that part of the key space. 


>For Whanau, it turns out to be more efficient to increase the routing 
table size instead of running multiple parallel instances. 
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Alternatively, the adversary might choose all its IDs to 
fall near the targeted key. Then, honest nodes might have 
to send many useless query messages to Sybil nodes be- 
fore eventually querying an honest node. 


5.2 Strawman protocol 


To illustrate how random walks apply to the problem of 
Sybil-proof DHT routing, consider the following straw- 
man protocol. In the setup phase, each node initiates 
r = O(km) independent length-w random walks on 
the social network. It collects a random key-value record 
from the final node of each walk, and stores these nodes 
and records in a local table. 

To perform a lookup, a node u consults its local record 
table. If the key is not in this table (which is likely), u 
broadcasts the key to the O(W/km) nodes v1,..., Up in 
its table. If those nodes’ tables are sufficiently large, with 
high probability, at least one node uv; will have the needed 
key-value record in its local table. 

The strawman protocol shows how random walks ad- 
dress the first and second challenges above. If the number 
of attack edges is small, most random walks stay within 
the honest region. Thus, the local tables contain mostly 
honest nodes and records. Furthermore, nodes use only 
random walks to build their tables: they never look at 
each other’s tables during the setup process. As a result, 
the adversary’s influence does not increase over time. 

The strawman sidesteps the third challenge by es- 
chewing node IDs entirely, but this limits its efficiency. 
Lookups are “one-hop” in the sense that the ideal lookup 
latency is a single network round-trip. However, since 
each lookup sends a large number of messages, perfor- 
mance will become limited by network bandwidth and 
CPU load as the network size scales up. By adding struc- 
ture, we can improve performance. The main challenge 
is to craft the structure in such a way that it cannot be 
exploited by a clustering attack. 


5.3. Whanau’s global structure 


Whanau’s structure resembles other DHTs such as 
Chord [24], SkipNet [12], and Kelips [11]. Like Skip- 
Net and Chord, Whanau assumes a given, global, circu- 
lar ordering < on keys (e.g., lexical ordering). The no- 
tation x1 < %2 <~ --: < x, means that for any indexes 
i <4 <k, the key 7; is on the arc. (7; 7). 


No metric space. Like SkipNet, but unlike Chord and 
many other DHTs, Whanau does not embed the keys 
into a metric space using a hash function. If Whanau 
were to use a hash function to map keys into a metric 
space, an adversary could use guess-and-check to con- 
struct many keys that fall between any two neighboring 
honest keys. This would warp the distribution of keys in 
the system and defeat the purpose of the hash function. 
Therefore, Whanau has no a priori notion of “distance” 


between two keys; it can determine only if one key falls 
between two other keys. This simple ordering provides 
some structure (e.g., a node can have a successor table), 
but still requires defenses to clustering attacks. 


Fingers and successors. Most structured DHTs have 
routing tables with both “far pointers”, sometimes called 
fingers, and “near pointers”, called leaves or successors. 
Whanau follows this pattern. All nodes have layered IDs 
(described below) which are of the same data type as the 
keys. Each node’s finger table contains O(/km) point- 
ers to other nodes with IDs spaced evenly over the key 
space. Likewise, each node’s successor table contains 
the O(\/km) honest key-value records immediately fol- 
lowing its ID. Finger tables are constructed simply by 
sending out O(Wkm) random walks, collecting a ran- 
dom sample of (honest and Sybil) nodes along with their 
layered IDs. Successor tables are built using a more com- 
plex sampling procedure (described in Section 6.1). 

Together, an honest node’s finger nodes’ successor ta- 
bles cover the entire set of honest keys, with high proba- 
bility. This structure enables fast one-hop lookups: sim- 
ply send a query message to a finger node preceding 
the target key. The chosen finger is likely to have the 
needed record in its successor table. (If not, a few retries 
with different fingers should suffice.) In contrast with the 
strawman protocol above, this approach uses a constant 
number of messages on average, and O(log n) messages 
(which may be sent in parallel) in the worst case. 


Layered IDs. Whanau defends against clustering at- 
tacks using layers, illustrated in Figure 3. Each node 
uses a random walk to choose a random key as its layer-0 
ID. This ensures that honest nodes’ layer-O IDs are dis- 
tributed evenly over the keys stored by the system. 

To pick a layer-1 ID, each node picks a random entry 
from its own layer-O finger table and uses that node’s ID. 
To pick a layer-2 ID, each node takes a random layer-1 
finger’s ID, and so on for each of the 2 = O(log km) lay- 


fendir Pwttt. Layer 0 


3°3"s Cluster attack on layer 1: 
ae layer 1 unbalanced, 
ess but layer 0 balanced 


Sybil ID cluster enrees 
mirrored by 
layer-1 honest IDs 


+4. Layer | 


Figure 3: Honest IDs (black dots) in layer 0 are uniformly dis- 
tributed over the set of keys (X axis), while Sybil IDs (red dots) 
may cluster arbitrarily. Honest nodes choose their layer 2 + 1 
IDs from the set of all layer 7 IDs (honest and Sybil). Thus, 
most layers are balanced. Even if there is a clustering attack on 
a key, it will always be easy to find an honest finger near the 
key using a random sampling procedure. 
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ers. In the end, each node is present at, and must collect 
successor tables for, £ positions in the key space. 

Layers defend against key clustering and ID cluster- 
ing attacks. If the attacker inserts many keys near a target 
key, this will simply cause more honest nodes to choose 
layer-O [Ds in that range. The number of keys the at- 
tacker can insert is limited by the number of attack edges. 
Thus, a key clustering attack only shifts around the hon- 
est nodes’ IDs without creating any hot or cold spots. 

Nodes choose their own IDs; thus, if the attacker 
chooses all its layer-O IDs to fall immediately before a 
target key, it might later be difficult to find an honest 
finger near the key. However, if the adversary manages 
to supply an honest node u with many clustered layer-0 
IDs, then this increases the probability that u will pick 
one of these clustered IDs as its own layer-1 ID. As a re- 
sult, the distribution of honest layer-1 [Ds tends to mimic 
any clusters in the Sybil layer-O IDs. This increases the 
honest nodes’ presence in the adversary-chosen range, 
and increases the likelihood that layer | finger tables are 
balanced between honest and Sybil nodes. 

The same pattern of balance holds for layers 2 through 
¢—1. As long as most layers have a good ratio of honest to 
Sybil IDs in every range, random sampling (as described 
in Section 6.2) can find honest fingers near any target key. 


5.4 Churn 


There are three sources of churn that Whanau must han- 
dle. First, computers may become temporarily unavail- 
able due to network failures, mobility, overload, crashes, 
or being turned off daily. We call this node churn. Wha- 
nau builds substantial redundancy into its routing tables 
to handle Sybil attacks, and this same redundancy is suf- 
ficient to handle temporary node failures. Section 9.5 
shows that increasing node churn results in a modest ad- 
ditional overhead. 

The second source of churn is changes to the social 
relationships between participants. This social churn re- 
sults in adding or deleting social connections. A single 
deleted link doesn’t impact Whanau’s performance, as 
long as the graph remains fast mixing and neither end- 
point became malicious. (If one did become malicious, it 
would be treated as an attack edge.) However, Whanau 
doesn’t immediately react to social churn, and can only 
incorporate added links by rebuilding its routing tables. 
Nodes which leave the DHT entirely are not immediately 
replaced. Therefore, until SETUP is invoked, the routing 
tables, load distribution, and so on will slowly become 
less reflective of the current social network, and perfor- 
mance will slowly degrade. 

Social network churn occurs on a longer time scale 
than node churn. For example, data from Mislove et al. 
indicates that the Flickr social network’s half-life is ap- 
proximately 6 weeks [18]. Running SETUP every day, or 
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every few minutes, would keep Whanau closely in sync 
with the current social graph. 

The final and most challenging source of churn is 
changes to the set of keys stored in the DHT. This key 
churn causes the distribution of keys to drift out of sync 
with the distribution of finger IDs. Reacting immediately 
to key additions and deletions can create “hot spots” in 
successor tables; this can only be repaired by re-running 
SETUP. Thus, in the worst case, newly stored keys will 
not become available until the tables are rebuilt. For 
some applications — like the IM example, in which each 
node only ever stores one key — this is not a problem as 
long as tables are refreshed daily. Other applications may 
have application-specific solutions. 

Unlike key churn, turnover of values does not present 
a challenge for Whanau: updates to the value associated 
with a key may always be immediately visible. For exam- 
ple, in the IM application, a public key’s current IP ad- 
dress can be changed at any time by the record’s owner. 
Value updates are not a problem because Whanau does 
not use the value fields when building its routing tables. 

Key churn presents a trade-off between the bandwidth 
consumed by rebuilding tables periodically and the delay 
from a key being inserted to the key becoming visible. 
This bandwidth usage is similar to stabilization in other 
DHTs; however, insecure DHTs can make inserted keys 
visible immediately, since they do not worry about clus- 
tering attacks. We hope to improve Whanau’s respon- 
siveness to key churn; we outline one idea in Section 10. 


6 The Whanau protocol 
This section defines SETUP and LOOKUP in detail. 


6.1 Setup 


The SETUP procedure takes each node’s social con- 
nections and the local key-value records to store as in- 
puts, and constructs four routing tables: 

e ids(u,i): u’s layer-2 ID, a random key zx. 

e fingers(u,i): u’s layer-7 fingers as (id, address) pairs. 
e succ(u,i): u’s layer-2 successor (key, value) records. 
e db(w): a sample of records used to construct succ. 
The global parameters ry, rs, ra, and & determine the 
sizes of these tables; SETUP also takes an estimate of the 
mixing time w as a parameter. Typically, nodes will have 
a fixed bandwidth and storage budget to allocate amongst 
the tables. Section 7 and Section 9 will show how varying 
these parameters impacts Whanau’s performance. 

The SETUP pseudocode (Figure 4) constructs the rout- 
ing tables in +1 phases. The first phase sends out rg ran- 
dom walks to collect a sample of the records in the social 
network and stores them in the db table. These samples 
are used to build the other tables. The db table has the 
good property that each honest node’s stored records are 
frequently represented in other honest nodes’ db tables. 
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SETUP (stored-records(-), neighbors(-); w, ra, 77, 1s, £) 
for each node u 
do db(w) — SAMPLE-RECORDS(u, ra) 
for2 —Otofl—1 
do for each node u 
do ids(u, 7) < CHOOSE-ID(u, 7) 
fingers(u,7) < FINGERS(u, i, r+) 
succ(u, 2) <— SUCCESSORS(u, 7, rs) 
return fingers, succ 


S AMPLE-RECORDS(u, 7a) 


for 7 — ltorg 
do v; — RANDOM-WALK(u) 
(key, value; ) —- SAMPLE-RECORD(vj) 
return {(key,, value),...,(key,.,, valuer,)} 


SAMPLE-RECORD(u) 


(key, value) — RANDOM-CHOICE(stored-records(w) ) 
return (key, value) 


RANDOM-WALK(uo) 


for 7 — ltow 
do u; — RANDOM-CHOICE(neighbors(u;—1)) 
return Uw 


Figure 4: SETUP procedure to build structured routing tables. 


The remaining phases are used to construct the @ lay- 
ers. For each layer 2, SETUP chooses each node’s IDs and 
constructs its successor and finger tables. The layer-0 ID 
is chosen by picking a random key from db. Higher-layer 
IDs and finger tables are defined mutually recursively. 
FINGERS(u,2,7r¢) sends out rf random walks and col- 
lects the resulting nodes and 7" layered IDs into u’s 7" 
layer finger table. For 7 > 0, CHOOSE-ID(u, 7) chooses 
u’s i layered ID by picking a random finger ID stored 
in u’s (i — 1)" finger table. As explained in Section 5.3, 
this causes honest IDs to cluster wherever Sybil IDs have 
clustered, ensuring a rough balance between good fingers 
and bad fingers in any given range of keys. 


Once a node has its ID for a layer, it must collect the 
successor list for that ID. It might seem that we could 
solve this the same way Chord does, by bootstrapping off 
LOOKUP to find the ID’s first successor node, then ask- 
ing it for its own successor list, and so on. However, as 
pointed out in Section 5.1, this recursive approach would 
enable the adversary to fill up the successor tables with 
bogus records over time. To avoid this, Whanau fills each 
node’s succ table without using any other node’s succ ta- 
ble; instead, it uses only the db tables. 


The information about any layered ID’s successors is 
spread around the db tables of many other nodes, so 
the SUCCESSORS subroutine must contact many nodes 
and collect little bits of the successor list together. The 
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CHOOSE-ID(wu, #) 
if7 =0 
then (key, value) — RANDOM-CHOICE(db(w)) 
return key 
else (x, f) — RANDOM-CHOICE(fingers(u,i— 1)) 
return x 


FINGERS(u, 2, rf) 
for 7 — ltor,s 
do v;  RANDOM-WALK(u) 
x; — ids(v;, 7%) 
return {(X1, V1), see (Trp, Ury)} 


SUCCESSORS(u, 7, 7's) 
for 7 — ltor, 
do v; — RANDOM-WALK(u) 
R; — SUCCESSORS-SAMPLE(v;, ids(u, i)) 
return R; U---U R,, 


SUCCESSORS-SAMPLE(u, Xo) 
{(key,, valuei),...,(key,.,, valuer,)} < db(u) 
(sorted so that ro x key, X--: X key, < Xo) 
return {(key,, valuei),..., (key,, valuez)} (for small t) 


Each function’s first parameter is the node it executes on. 


straightforward way to do this is to ask each node v for 
the closest record in db(v) following the ID. 


The SUCCESSORS subroutine repeatedly calls 
SUCCESSORS-SAMPLE r, times, each time accumu- 
lating a few more potential-successors. SUCCESSORS- 
SAMPLE works by contacting a random node and 
sending it a query containing the ID. The queried node 
v, if it is honest, sorts all of the records in its local db(v) 
by key, and then returns the closest few records to the 
requestor’s ID. The precise number ¢ of records sampled 
does not matter for correctness, so long as ¢ is small 
compared to rg. Section 7’s analysis simply lets ¢ = 1. 


This successor sampling technique ensures that for ap- 
propriate values of rg and r,, the union of the repeated 
queries will contain all the desired successor records. 
Section 7.1 will state this quantitatively, but the intuition 
is as follows. Each SUCCESSORS-SAMPLE query is an 
independent and random sample of the set of keys in the 
system which are near the ID. There may be substantial 
overlap in the result sets, but for sufficiently large r,, we 
will eventually receive all immediate successors. Some 
of the records returned will be far away from the ID and 
thus not really successors, but they will show up only 
a few times. Likewise, bogus results returned by Sybil 
nodes consume some storage space, but do not affect cor- 
rectness, since they do not prevent the true successors 
from being found. 
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LOOKUP(u, key) 
VU—U 
repeat value — TRY(v, key) 
v — RANDOM-WALK(u) 
until TRY found valid value, or hit retry limit 
return value 


TRY(u, key) 
{(x1, fi), eS (Trp, frp) —- fingers (u, 0) 
(sorted so key X 41 X--+ X @r, ~ key) 
Ja Ne 
repeat (f,7) — CHOOSE-FINGER(u, x;, key) 
value <- QUERY(f, 2, key) 
peg 1 
until QUERY found valid value, or hit retry limit 
return value 


CHOOSE-FINGER(u, Xo, key) 
for: Otof—1 
do fF, — { (a, f) € fingers(u,i) | ro < x X key } 
71 ~ RANDOM-CHOICE({7€ {0,...,—1} | F; non-empty }) 
(x, f) — RANDOM-CHOICE(F;) 
return (f, 7) 


QUERY(u, 2, key) 
if (key, value) € succ(u, 7) for some value 
then return value 
else error “not found” 


Figure 5: LOOKUP procedure to retrieve a record by key. 


In order to process requests quickly, each node should 
sort its finger tables by ID and its successor tables by key. 


6.2 Lookup 


The basic goal of the LOOKUP procedure is to find 
a finger node which is honest and which has the target 
record in its successor table. The SETUP procedure en- 
sures that any honest finger f which is “close enough” to 
the target key y will have y € succ(f). Since every fin- 
ger table contains many random honest nodes, each node 
is likely to have an honest finger which is “close enough” 
(if ry is big enough). However, if the adversary clus- 
ters IDs near the target key, then LOOKUP might have 
to waste many queries to Sybil fingers before finding this 
honest finger. LOOKUP’s pseudocode (Figure 5) chooses 
fingers carefully to foil this category of attack. 

To prevent the adversary from focusing its attack on 
a single node’s finger table, LOOKUP tries first using its 
own finger table, and, if that fails, repeatedly chooses a 
random delegate and retries the search from there. 

The TRY subroutine searches the finger table for the 
closest layer-zero ID xo to the target key key. It then 
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chooses a random layer 7 to try, and a random finger f 
whose ID in that layer lies between xo and the target key. 
TRY then queries f for the target key. 

If there is no clustering attack, then the layer-zero ID 
Xo 1S likely to be an honest ID; if there is a clustering 
attack, that can only make x9 become closer to the target 
key. Therefore, in either case, any honest finger found 
between x and key will be close enough to have the 
target record in its successor table. 

Only one question remains: how likely is CHOOSE- 
FINGER to pick an honest finger versus a Sybil finger? 
Recall from Section 5.3 that, during SETUP, if the adver- 
sary clustered his IDs in the range |, key] in layer 2, 
then the honest nodes tended to cluster in the same range 
in layer 2 + 1. Thus, the adversary’s fingers cannot dom- 
inate the range in the majority of layers. Now, the layer 
chosen by CHOOSE-FINGER is random — so, probably 
not dominated by the adversary — and therefore, a finger 
chosen from that layer is likely to be honest. 

In conclusion, for most honest nodes’ finger tables, 
CHOOSE-FINGER has a good probability of returning an 
honest finger which is close enough to have the target key 
in its successor table. Therefore, LOOKUP should almost 
always succeed after only a few calls to TRY. 


7 Analysis of Whanau’s performance 


For the same reason as a flooding protocol, Whanau’s 
LOOKUP will always eventually succeed if it runs for 
long enough: some random walk (LOOKUP, line 3) will 
find the target node. However, the point of Whanau’s 
added complexity is to improve lookup performance be- 
yond a flooding algorithm. This section sketches the rea- 
soning why LOOKUP uses O(1) messages to find any tar- 
get key, leaving out most proofs; more detailed proofs 
can be found in an accompanying technical report [15]. 

To the definitions in Section 4, we will add a few more 
in order to set up our analysis. 


Definition (good sample probability). Let p be the prob- 
ability that a random walk starting from a winner node 
returns a good sample (a random honest node). p de- 
creases with the number of attack edges g. Specifically, 
we have previously shown that p > 5 (1 — ) for any 
€ [15]. We are interested in the case where g < oa = 
O(=). In this case, we have that p > 1/4, so a substan- 
tial fraction of random walks return good samples. 


Definition (“the database”). Let D be the disjoint union 
of all the honest nodes’ db tables: 


D= WW ddb(u) 
honest wu 


Intuitively, we expect honest nodes’ records to be heavily 
represented in D. D has exactly rgm elements; we expect 
at least (1 — €)pram of those to be from honest nodes. 
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Definition (distance metric dz,,). Recall from Sec- 
tion 5.3 that Whanau has no a priori notion of distance 
between two keys. However, with the definition of D, we 
can construct an a posteriori distance metric. 

Let D,, = {z € D | x X z < y} be all the records 
(honest and Sybil) in D on the arc | x, y). Then define 





d a Dyy| = Dey| 
ve |D| ram 








€ (0,1) 


Note that d;,, is not used (or indeed, observable) by the 
protocol; we use it only in the analysis. 


7.1 Winner successor tables are correct 


Recall that SETUP (Figure 4) uses the SUCCESSORS sub- 
routine, which calls SUCCESSORS-SAMPLE r, times, to 
find all the honest records in D immediately following 
an ID x. Consider an arbitrary successor key y € D. If 
the rg and r, are sufficiently large, and d,,, is sufficiently 
small, then y will almost certainly be returned by some 
call to SUCCESSORS-SAMPLE. Thus, any winner node 
u’s table succ(u, 7) will ultimately contain all records y 
close enough to the ID x = ids(u,7). 


Lemma. Call SUCCESSORS-SAMPLE(x) 1, times. We 
then have (for rg, dz, >> landrs <n) a Probffail] of: 


1- . 
Probly ¢ succ(u,z)| < : — Caran 
pra 


Under the simplifying assumption rg < 1 te < km: 


Probly € succ(u, 7)] < ee ©) Pmt (1) 


We can intuitively interpret this result as follows: to 
get a complete successor table with high probability, 
we need r,r7qg = OQ(kmlogkm). This is related to the 
Coupon Collector’s Problem: the SUCCESSORS subrou- 
tine examines 7,7g random elements from VD, and it must 
examine the entire set of km honest records. 


7.2 Layer zero IDs are evenly distributed 


Consider an arbitrary winner u’s layer-zero finger table 
Fo = fingers(u,0): approximately pry of the nodes in 
Fo will be random honest nodes. Picking a random hon- 
est node f € Fo and then picking a random key from 
db( f ) is the same as picking a random key from D. Thus, 
pry of the IDs in Fo are random keys from D. For any 
keys x,y € D, the probability that a random honest fin- 
ger’s layer-zero ID falls in the range | x, y) is simply dz,. 


Lemma. With rs fingers, we have a Prob|fail] of: 
Prob| no layer-0 finger in| x,y) |< (1—day)?"? (2) 


We expect to find approximately pr sdzy of these honest 
fingers with IDs in the range | x, y). 


We can intuitively interpret this result as follows: to 
see ()(1) fingers in | x, y) with high probability, we need 
rp = Q(logm/dz,). In other words, large finger tables 
enable nodes to find a layer-0 finger in any small range of 
keys. Thus layer-O finger tables tend to cover D evenly. 


7.3. Layers are immune to clustering 


The adversary may attack the finger tables by clustering 
its IDs. CHOOSE-ID line 4 causes honest nodes to re- 
spond by clustering their IDs on the same keys. 

Pick any keys x,y € D sufficiently far apart that we 
expect at least one layer-zero finger ID in [x, y) with 
high probability (as explained above). Let 3; (“bad fin- 
gers’) be the average (over winners nodes’ finger tables) 
of the number of Sybil fingers with layer-i IDs in | x, y). 
Likewise, let 7; (“good fingers’’) be the average number 
of winner fingers in [, y). Define w = (1 — e)p. 


Lemma. The number of good fingers in |x, y) is propor- 
tional to the total number of fingers in the previous layer: 


Vin 2 Ula + Bi) 


Corollary. Let the density p; of winner fingers in layer 1 
def 


be pi = yi/ (yi + i). Then TT,=5 ps 2 wo*/(1— pry. 
Because the density of winner fingers p; is bounded 
below, this result means that the adversary’s scope to af- 
fect p; 1s limited. The adversary may strategically choose 
any values of (3; between zero and (1 — ju)r¢. However, 
the adversary’s strategy is limited by the fact that if it 
halves the density of good nodes in one layer, the density 
of good nodes in another layer will necessarily double. 


Theorem. The average layer’s density of winner fingers 
is atleast p = 4 a io Si L—wiare| 7, 
Observe that as ¢ — 1, the average layer’s density of 
good fingers shrinks exponentially to O(1/r ), and that 
as £ — ov, the density of good fingers asymptotically 
approaches the limit j/e. We can get p within a factor of 
e of this ideal bound by setting the number of layers @ to 


€ = log [(1 — w)ury| (3) 
For most values of 44 € [0,1], 2 = log r¢. However, when 
jt approaches 1 (no attack) or O (strong attack), 2 — 1. 


7.4 Main result: lookup is fast 


The preceding sections’ tools enable us to prove that 
Whanau uses a constant number of messages per lookup. 


Theorem (Main theorem). Define kK = kme/(1 — €)p. 


Suppose that we pick rs, rf, Ta, and £ so that (3) and (4) 
are satisfied, and run SETUP to build routing tables. 


rev 
rerp > a > (4) 
Now run LOOKUP on any valid key y. Then, a single 


iteration of TRY succeeds with probability better than 
Prob| success | > s5(1 — €)p = Q(1). 
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The value « is the aggregate storage capacity km of 
the DHT times an overhead factor e/(1—€)p* which rep- 
resents the extra work required to protect against Sybil 
attacks. When g < 5“, this overhead factor is O(1). 

The formula (4) may be interpreted to mean that 
both rsrq and rsrf must be Q(K): the first so that 
SUCCESSORS-SAMPLE 1s called enough times to collect 
every successor, and the second so that successor lists are 
longer than the distance between fingers. These would 
both need to be true even with no adversary. 


Proof sketch. Let x € D be a key whose distance to the 
target key y is d;, = 1/pry, the average distance be- 
tween honest fingers. 

First, substitute the chosen d,,,, into (2). By the lemma, 
the probability that there is an honest finger x, € [ x, y) 
is at least 1 — 1/e. TRY line 1 finds x, ,> the closest 
layer-zero finger to the target key, and TRY passes it to 
CHOOSE-FINGER as Xo. Xo may be an honest finger or a 
Sybil finger, but in either case, it must be at least as close 
to the target key as x,. Thus, xo € |x, y) with probabil- 
ity at least 1 — 1/e. 

Second, recall that CHOOSE-FINGER first chooses a 
random layer, and then a random finger f from that layer 
with ID x>¢ € | 2%, y]. The probability of choosing any 
given layer i is ~', and the probability of getting an hon- 
est finger from the range is p; from Section 7.3. Thus, the 
total probability that CHOOSE-FINGER returns an hon- 
est finger is simply the average layer’s density of good 
nodes > S| pi = p. Since we assumed (3) was satisfied, 
Section 7.3 showed that the probability of success is at 
least p > (1 — €)p/e?. 

Finally, if the chosen finger f is honest, the only ques- 
tion remaining is whether the target key is in f’s suc- 
cessor table. Substituting d;,4, < dz, and (4) into (1) 
yields Prob] y € succ(f) | = 1 — 1/e. Therefore, when 
QUERY(f,y) checks f’s successor table, it succeeds 
with probability at least 1 — 1/e. 

A TRy iteration will succeed if three conditions hold: 
(1) x € | x,y); (2) CHOOSE-FINGER returns a winning 
finger f; (3) y € succ(f). Combining the probabilities 
calculated above for each of these events yields the total 
success probability (1—=) U—<)p (1-+)>Z(l-e)p. O 


e2 





Corollary. The expected number of queries sent by 
LOOKUP is bounded by = = O(1). With high prob- 
ability, the maximum number of queries is O(log n). 





7.5 Routing tables are small 


Each (virtual) node has S = rqg+¢(ry¢ +r.) table entries 
in total. To minimize S subject to (4), setr, = rf = /K 
and rg = p,/k. Therefore, the optimal total table size is 
o/h lek, 60.5 = O(Vkm log km), as expected. 
As the number of attack edges g increases, the required 
table size grows as (1—«€)~!/2p~3/?. A good approxima- 
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tion for this security overhead factor is 1 + 2 JF + 62° 
when g < zg. Thus, overhead grows linearly with g. 

As one might expect for a one-hop DHT, the optimum 
finger tables and the successor tables are the same size. 
The logarithmic factor in the total table size comes from 
the need to maintain O(log km) layers to protect against 
clustering attacks. If the number of attack edges is small, 
(3) indicates that multiple layers are unnecessary. This is 
consistent with the experimental data in Section 9.3. 


8 Implementation 


We have implemented Whanau in a simulator and on 
PlanetLab. To simulate very large networks — some of 
the social graphs we use have millions of nodes — we 
wrote our own simulator. Existing peer-to-peer simula- 
tors don’t scale to such a large number of nodes, and our 
simulator uses many Whanau-specific optimizations to 
reduce memory consumption and running time. The sim- 
ulator directly implements the protocol as described in 
Figures 4 and 5, takes a static social network as input, 
and provides knobs to experiment with Whanau’s dif- 
ferent parameters. The simulator does not simulate real- 
world network latencies and bandwidths, but only counts 
the number of messages that Whanau sends. The primary 
purpose of the simulator is to validate the correctness and 
scaling properties of Whanau with large social networks. 

We also implemented Whanau and the IM applica- 
tion in Python on PlanetLab. This implementation runs 
a message-passing protocol to compute SETUP and uses 
RPC to implement LOOKUP. When a user starts a node, 
the user provides the keys and current IP addresses that 
identify their social neighbor nodes. The IM client stores 
its current IP address into the DHT. When a user wants 
to send an IM to another user, the IM client looks up the 
target user’s contact information in the DHT and authen- 
ticates the returned record using the key. If the record 
is authentic, the IM application sends the IM to the IP 
address in the record. Whanau periodically rebuilds its 
tables to incorporate nodes which join and leave. 

The average latency for a lookup is usually one round- 
trip on PlanetLab. Using locality-aware routing, Whanau 
could achieve lower than one network round-trip on av- 
erage, but we haven’t implemented this feature yet. 

Our PlanetLab experiments were limited by the num- 
ber of PlanetLab nodes available and their resources: we 
were able to run up to 4000 Whanau nodes simultane- 
ously. Unfortunately, at scales smaller than this, Whanau 
nearly reduces to simple broadcast. Given this practical 
limitation, it was difficult to produce insightful scaling 
results on PlanetLab. Furthermore, although results were 
broadly consistent at small scales, we could not cross- 
validate the simulator at larger scales. The PlanetLab ex- 
periments primarily demonstrated that Whanau works on 
a real network with churn, varying delays, and so on. 


USENIX Association 


USENIX Association 


n=#nodes m=t#edges avg. degree 
Flickr 1,624,992 15,476,835 a 
LiveJournal 5,189,809 48,688,097 9.38 
YouTube 1,134,890 2,987,624 203 
DBLP 511,163 1,871,070 3.66 
Table 2: Properties of the input data sets. 
9 Results 


This section experimentally verifies several hypotheses: 
(1) real-world social networks exhibit the properties that 
Whanau relies upon; (2) Whanau can handle clustering 
attacks (tested by measuring its performance versus ta- 
ble size and the number of attack edges); (3) layered IDs 
are essential for handling clustering attacks; (4) Whanau 
achieves the same scalability as insecure one-hop DHTs; 
and (5) Whanau can handle node churn in Planetlab. 

Our Sybil attack model permits the adversary to cre- 
ate an unlimited number of pseudonyms. Since previous 
DHTs cannot tolerate this attack at all, this section does 
not compare Whanau’s Sybil-resistance against previous 
DHTs. However, in the non-adversarial case, the exper- 
iments do show that Whanau scales like any other inse- 
cure one-hop DHT, so (ignoring constant factors such as 
cryptographic overhead) adding security is “free”. Also, 
similarly to other (non-locality-aware) one-hop DHTs, 
the lookup latency is one network round-trip. 


9.1 Real-world social nets fit assumptions 


Nodes in the Whanau protocol bootstrap from a social 
network to build their routing tables. It is important 
for Whanau that the social network is fast mixing: that 
is, a Short random walk starting from any node should 
quickly approach the stationary distribution, so that there 
is roughly an equal probability of ending up at any edge 
(virtual node). We test if this fast-mixing property holds 
for social network graphs, extracted from Flickr, Live- 
Journal, YouTube, and DBLP, which have also been used 
in other studies [18, 26]. These networks correspond to 
real-world users and their social connections. The Live- 
Journal graph was estimated to cover 95.4% of the users 
in Dec 2006, and the Flickr graph 26.9% in Jan 2007. 
We preprocessed the input graphs by discarding un- 
connected nodes and transforming directed edges into 
undirected edges. (The majority of links were already 
symmetric.) The resulting graphs’ basic properties are 
shown in Table 2. The node degrees follow power law 
distributions, with coefficients between 1.6 and 2 [18]. 
To test the fast-mixing property, we sample the distri- 
bution of random walks as follows. We pick a random 
starting edge 2, and for each ending edge 7, compute the 
probability p;; that a walk of length w ends at 7. Com- 
puting p;; for all m possible starting edges 2 is too time- 
intensive, So we sampled 100 random starting edges 7 and 
computed p;; for all m end edges 7. For a fast-mixing 
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Figure 6: Mixing properties of social graphs. Each line shows 
a CDF of the probability that a w-step random walk ends on a 
particular edge. The X axis is normalized so that the mean is 1. 


network, we expect the probability of ending up at a par- 
ticular edge to approach 1/m as w increases to O(log n). 

Figure 6 plots the CDF of p;; for increasing values of 
w. To compare the different social graphs we normalize 
the CDFs so that they have the same mean. Thus, for all 
graphs, p;; = 1/m corresponds to the ideal line at 1. As 
expected, as the number of steps increases to 80, each 
CDF approaches the ideal uniform distribution. 


The CDFs at w = 10 are far from the ideal distribu- 
tion, but there are two reasons to prefer smaller values 
of w. First, the amount of bandwidth consumed scales as 
w. Second, larger values of w increase the chance that 
a random walk will return a Sybil node. Section 9.2 will 
show that Whanau works well even when the distribution 
of random walks is not perfect. 

Recall from Section 4.2 that when a fast-mixing social 
network has a sparse cut between the honest nodes and 
Sybil nodes, random walks are a powerful tool to pro- 
tect against Sybil attacks. To confirm that this approach 
works with real-world social networks, we measured the 
probability that a random walk escapes the honest region 
of the Flickr network with different numbers of attack 
edges. To generate an instance with g attack edges, we 
marked random nodes as Sybils until there were at least 
g edges between marked nodes and non-marked nodes, 
and then removed any honest nodes which were con- 
nected only to Sybil nodes. For example, for the Flickr 
graph, in the instance with g = 1,940,689, there are 
n = 1,442,120 honest nodes (with m = 13,385, 439 
honest edges) and 182,872 Sybil nodes. Since increasing 
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Walk length 


Figure 7: Escape probability on the Flickr network. 


the number of attack edges this way actually consumes 
honest nodes, it is not possible to test the protocol against 
g/n ratios substantially greater than 1. 

Figure 7 plots the probability that a random walk start- 
ing from a random honest node will cross an attack edge. 
As expected, this escape probability increases with the 
number of steps and with the number of attack edges. 
When the number of attack edges is greater than the num- 
ber of honest nodes, the adversary has convinced essen- 
tially all of the system’s users to form links to its Sybil 
identities. In this case, long walks almost surely escape 
from the honest region; however, short walks still have 
substantial probability of reaching an honest node. For 
example, if the adversary controls 2 million attack edges 
on the Flickr network, then each user has an average of 
1.35 links to the adversary, and random walks of length 
40 are 90% Sybils. On the other hand, random walks of 
length 10 will return 60% honest nodes, although those 
honest nodes will be less uniformly distributed than a 
longer random walk. 


9.2 Performance under clustering attack 


To evaluate Whanau’s resistance against the Sybil attack, 
we ran instances of the protocol using a range of ta- 
ble sizes, number of layers, and adversary strengths. For 
each instance, we chose random honest starting nodes 
and measured the number of messages used by LOOKUP 
to find randomly chosen target keys. Our analysis pre- 
dicted that the number of messages would be O(1) as 
long as g < n/w. Since we used a fixed w = 10, the 
number of messages should be small when the number 
of attack edges is less than 10% of the number of hon- 
est nodes. We also expected that increasing the table size 
would reduce the number of messages. 

Our simulated adversary employs a clustering attack 
on the honest nodes’ finger tables, choosing all of its IDs 
to immediately precede the target key. In a real-world de- 
ployment of Whanau, it is only possible for an adversary 
to target a small fraction of honest keys in this way: to 1n- 
crease the number of Sybil IDs near a particular key, the 
adversary must move some Sybil IDs away from other 
keys. However, in our simulator, we allowed the adver- 
sary to change its IDs between every LOOKUP operation: 
that is, it can start over from scratch and adapt its attack 
to the chosen target key. Our results therefore show Wha- 
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Figure 8: Number of messages used by LOOKUP decreases as 
table size increases (Flickr social network). 


nau’s worst case performance, and not the average case 
performance for random target keys. 

Figure 8 plots the number of messages required by 
LOOKUP versus table size. Since our policy is that re- 
sources scale with node degree (Section 3.3), we mea- 
sure table size in number of entries per social link. Each 
table entry contains a key and a node’s address (finger 
tables) or a key-value pair (successor and db tables). 

As expected, the number of messages decreases with 
table size and increases with the adversary’s power. For 
example, on the Flickr network and with a table size of 
10,000 entries per link, the median LOOKUP required 2 
messages when the number of attack edges is 20,000, but 
required 20 messages when there are 2,000,000 attack 
edges. The minimum resource budget for fast lookups is 
1,000 = /n: below this table size, LOOKUP messages 
increased rapidly even without any attack. Under a mas- 
sive attack (g > n) LOOKUP could still route quickly, 
but it required a larger resource budget of > 10, 000. 

Figure 9 shows the full data set of which Figure 8 is a 
slice. Figure 9(a) shows the number of messages required 
for 100% of our test lookups to succeed. Of course, most 
lookups succeeded with far fewer messages than this up- 
per bound. Figure 9(b) shows the number of messages re- 
quired for 50% of lookups to succeed. The contour lines 
for maximum messages are necessarily noisier than for 
median messages, because the lines can easily be shifted 
by the random outcome of a single trial. The median is a 
better guideline to Whanau’s expected performance: for 
a table size of 5,000 on the Flickr graph, most lookups 
will succeed within 1 or 2 messages, but a few outliers 
may require 50 to 100 messages. 

We normalized the X-axis of each plot by the number 
of honest nodes in each network so that the results from 
different datasets could be compared directly. Our the- 
oretical analysis predicted that Whanau’s performance 
would drop sharply (LOOKUP messages would grow ex- 
ponentially) when g > n/10. However, we observed 
that, for all datasets, this transition occurs in the higher 
range m/10 < g < m. In other words, the analytic pre- 
diction was a bit too pessimistic: Whanau functions well 
until a substantial fraction of all edges are attack edges. 

When the number of attack edges g was below n/10, 
we observed that performance was more a function of ta- 
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(a) Maximum messages required for every lookup to succeed. 
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(b) Median messages required for lookups to succeed. 


Figure 9: Heat map and contours of the number of messages used by LOOKUP, versus attacker strength and table size. In the light 
regions at upper left, where there are few attack edges and a large resource budget, LOOKUP succeeded using only one message. In 
the dark regions at lower right, where there are many attack edges and a small resource budget, LOOKUP needed more than the retry 
limit of 120 messages. Wedges indicate where g = m/w and g = m; when g >> m/w, LOOKUP performance degrades rapidly. 
The plots’ right edges do not line up because it was not always possible to create an adversary instance with g = 10n. 


ble size, which must always be at least Q(,/m) for Wha- 
nau to function, than of g. Thus, Whanau’s performance 
is insensitive to relatively small numbers of attack edges. 


9.3. Layers vs. clustering attacks 


Section 9.2 showed that Whanau handles clustering at- 
tacks. For the plots in Figure 9, we simulated sev- 
eral different numbers of layers and chose the best- 
performing value for a given table size. This section eval- 
uates whether layers are important for Whanau’s attack 
resistance, and investigates how the number of layers 
should be chosen. 


Are layers important? We ran the same experiment as 
in Section 9.2, but we held the total table size at a con- 
stant 100,000 entries per link. We varied whether the pro- 
tocol spent those resources on more layers, or on bigger 


per-layer routing tables, and measured the median num- 
ber of messages required by LOOKUP. 


We would expect that for small-scale attacks, one layer 
is best, because layers come at the cost of smaller per- 
layer tables. For more large-scale attacks, more layers is 
better, because layers protect against clustering attacks. 
Even for large-scale attacks, adding more layers yields 
quickly diminishing returns, and so we only simulated 
numbers of layers between | and 10. 


The solid lines in Figure 10 shows the results for 
the clustering attack described in Section 9.2. When the 
number of attack edges is small, the best performance 
would be achieved by spending all resources on bigger 
routing tables, mostly avoiding layers. For Flickr, layers 
become important when the number of attack edges ex- 
ceeds 5,000 (0.3% of n); for g > 20,000, a constant 
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Figure 10: Optimal layers versus attacker power. The resource 
budget was fixed at 100K table entries per link. 


number of layers (around 8) would yield the best perfor- 
mance. At high attack ratios (around g/n > 1), the data 
becomes noisy because performance degrades regardless 
of the choice of layers. 


The dashed lines in Figure 10 show the same simu- 
lation, but pitted against a naive attack: the adversary 
swallows all random walks and returns bogus replies to 
all requests, but does not cluster its [Ds. This control 
data clearly shows that multiple layers are only help- 
ful against a clustering attack. The trends are clearer for 
the larger graphs (Flickr and LiveJournal) than for the 
smaller graphs (YouTube and DBLP). 100,000 table en- 
tries is very large in comparison to the smaller graphs’ 
sizes, and therefore the differences in performance be- 
tween small numbers of layers are not as substantial. 


How many layers should nodes use? The above data 
showed that layers improve Whanau’s resistance against 
powerful attacks but are not helpful when the DHT is 
not under attack. However, we cannot presume that nodes 
know the number of attack edges g, so the number of lay- 
ers must be chosen in some other way. Since layers cost 
resources, we would expect the optimal number of layers 
to depend on the node’s resource budget. If the number 
of table entries is large compared to \/m, then increas- 
ing the number of layers is the best way to protect against 
powerful adversaries. On the other hand, if the number of 
table entries is relatively small, then no number of layers 
will protect against a powerful attack; thus, nodes should 
use a smaller number of layers to reduce overhead. 


We tested this hypothesis by re-analyzing the data col- 
lected for Section 9.2. For a given table size, we com- 
puted the number of layers that yielded optimal perfor- 
mance over a range of attack strengths. The results are 
shown in Figure 11. The overall trend is clear: at small 
table sizes, fewer layers is preferred, and at large table 
sizes, more layers is better. 
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Figure 11: Optimal layers versus resource budget. Each point 
is a table size / attacker power instance. Larger points corre- 
spond to multiple instances. The trend line passes through the 
median point for each table size. 


The optimal number of layers is thus a function of 
the social network size and the resource budget, and we 
presume that honest nodes know both of these values at 
least approximately. Since Whanau’s performance is not 
very sensitive to small changes in the number of layers, a 
rough estimate is sufficient to get good performance over 
a wide range of situations. 


9.4 Whanau’s scalability 


Whanau is designed as a one-hop DHT. We collected 
simulated data to confirm that Whanau’s performance 
scales asymptotically the same as an insecure one-hop 
DHT such as Kelips [11]. Since we don’t have access 
to a wide range of social network datasets of different 
sizes, we generated synthetic social networks with vary- 
ing numbers of nodes using the standard technique of 
preferential attachment [1], yielding power-law degree 
distributions with exponents close to 2. For each net- 
work, we simulated Whanau’s performance for various 
table sizes and layers, as in the preceding sections. Since 
our goal was to demonstrate that Whanau reduces to a 
standard one-hop DHT in the non-adversarial case, we 
did not simulate any adversary. 

Figure 12 plots the median number of LOOKUP mes- 
sages versus table size and social network size. For a one- 
hop DHT, we expect that, holding the number of mes- 
sages to a constant O(1), the required table size scales 
as O(,/m): the blue line shows this predicted trend. 
The heat map and its contours (black lines) show sim- 
ulated results for our synthetic networks. For example, 
form = 10, 000, 000, the majority of lookups succeeded 
using | or 2 messages for a table size of © 2, 000 entries 
per link. The square and triangle markers plot our four 
real-world datasets alongside the synthetic networks for 
comparison. While each real network has idiosyncratic 
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Figure 12: Number of messages used by LOOKUP, versus sys- 
tem size and table size. The heat map and contour lines show 
data from synthetic networks, while the markers show that real- 
world social networks fall roughly onto the same contours. 
Whanau scales like a one-hop DHT. 


features of its own, it 1s clear that the table sizes follow 
the O(,/m) scaling trend we expect of a one-hop DHT. 


9.5 PlanetLab and node churn 


Whanau’s example IM application runs on PlanetLab. 
We performed an experiment in which we started 4000 
virtual nodes, running on 400 PlanetLab nodes. This 
number of virtual nodes is large enough that, with a rout- 
ing table size of 200 entries per social link, most requests 
cannot be served from local tables. Each node continu- 
ously performed lookups on randomly-chosen keys. 

We simulated node churn by inducing node failure and 
recovery events according to a Poisson process. These 
events occurred at an average rate of two per second, but 
we varied the average node downtime. At any given time, 
approximately 10% or 20% of the virtual nodes were of- 
fline. (In addition to simulating 10% and 20% failures, 
we simulated an instance without churn as a control.) We 
expected lookup latency to increase over time as some 
finger nodes became unavailable and some lookups re- 
quired multiple retries. We also expected latency to go 
down whenever SETUP was re-run, building new routing 
tables to reflect the current state of the network. 

Figure 13 plots the lookup latency and retries for these 
experiments, and shows that Whanau is largely insensi- 
tive to modest node churn. The median latency is approx- 
imately a single network roundtrip within PlanetLab, and 
increases gradually as churn increases. As expected, the 
fraction of requests needing to be retried increased with 
time when node churn was present, but running SETUP 
restored it to the baseline. 

While this experiment’s scale is too small to test Wha- 
nau’s asymptotic behavior, it demonstrates two points: 
(1) Whanau functions on PlanetLab, and (2) Whanau’s 
simple approach for maintaining routing tables is suffi- 
cient to handle reasonable levels of churn. 


Median latency (ms) 





0 200 400 600 800 1000 1200 1400 
Time (seconds) 





Percent queries retried 


0 200 400 600 800 1000 1200 1400 
Time (seconds) 


Figure 13: Lookup latency and fraction of lookups which re- 
quired retries on PlanetLab under various levels of node churn. 
Vertical lines indicate when SETUP installed new routing ta- 
bles. Under churn, the retry frequency slowly increases until 
SETUP runs again, at which point it reverts to the baseline. 


10. Discussion 


This section discusses some engineering details and sug- 
gests some improvements that we plan to explore in fu- 
ture work. 


Systolic mixing process. Most of Whanau’s band- 
width is used to explore random walks. Therefore, it 
makes sense to optimize this part of the protocol. Using 
a recursive or iterative RPC to compute a random walk, 
as suggested by Figure 4, is not very efficient: it uses w 
messages per random node returned. 

A better approach, implemented in our PlanetLab ex- 
periment, is to batch-compute r walks at once. Suppose 
that every node maintains a pool of r addresses of other 
nodes; the pools start out containing r copies of the 
node’s own address. At each time step, each node ran- 
domly shuffles its pool and divides it equally amongst its 
social neighbors. For the next time step, the node com- 
bines the messages it received from each of its neighbors 
to create a new pool, and repeats. After w such mixing 
steps, each node’s pool is a randomly shuffled assortment 
of addresses. If r is sufficiently large, this process ap- 
proximates sending out r random walks from each node. 


Very many or very few keys per node. ‘The protocol 
described in this paper handles 1 < k S m well, where 
k; is the number of keys per honest node. The extreme 
cases outside this range will require tweaks to Whanau 
to handle them. Consider the case k > m. Any DHT 
requires at least k = ()(m) resources per node just to 
transmit and store the keys. This makes the task easier, 
since we could use O(m) bandwidth to collect a nearly- 
complete list of all other honest nodes on each honest 
node. With such a list, the task of distributing successor 
records is a simple variation of consistent hashing [13]. 
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The analysis in Section 7.1 breaks down for k > m: 
more than ™m calls to SUCCESSORS-SAMPLE will tend 
to have many repeats, and thus can’t be treated as inde- 
pendent trials. To recover this property, we can treat each 
node as k:/m virtual nodes, as we did with node degrees. 

Now consider the other extreme: k < 1,1.e. only some 
nodes are storing key-value records in the system. The 
extreme limiting case is only a single honest node stor- 
ing a key-value record into the system, ic. k = 1/m. 
Whanau can be modified to handle the case k < 1 by 
adopting the systolic mixing process described above and 
omitting empty random walks. This reduces to flooding 
in the extreme case, and smoothly adapts to larger k. 


Handling key churn. It is clear that more bandwidth 
usage can be traded off against responsiveness to churn: 
for example, running SETUP twice as often will result in 
half the latency from key insertion to key visibility. Using 
the observation that the DHT capacity scales with the ta- 
ble size squared, we can improve this bandwidth-latency 
tradeoff. Consider running SETUP every T’ seconds with 
R resources, yielding a capacity of K = O(R?) keys. 
Compare with this alternative: run SETUP every T'/2 
seconds using R/2 resources, and save the last four in- 
stances of the routing tables. Each instance will have ca- 
pacity K’/4, but since we saved four instances, the total 
capacity remains the same. The total resource usage per 
unit time also remains the same, but the responsiveness 
to churn doubles, since SETUP runs twice as often. 

This scaling trick might seem to be getting “some- 
thing for nothing”. Indeed, there is a price: the number of 
lookup messages required will increase with the number 
of saved instances. However, we believe it may be pos- 
sible to extend Whanau so that multiple instances can be 
combined into a single larger routing table, saving both 
storage space and lookup time. 


11 Summary 


This paper presents the first efficient DHT routing pro- 
tocol which is secure against powerful denial-of-service 
attacks from an adversary able to create unlimited 
pseudonyms. Whanau combines previous ideas — ran- 
dom walks on fast-mixing social networks — with the 
idea of layered identifiers. We have proved that lookups 
complete in constant time, and that the size of routing 
tables is only O(/km log km) entries per node for an 
aggregate system capacity of km keys. Simulations of 
an aggressive clustering attack, using social networks 
from Flickr, LiveJournal, YouTube, and DBLP, show that 
when the number of attack edges is less than 10% of the 
number of honest nodes and the routing table size is \/m, 
most lookups succeed in only a few messages. Thus, the 
Whanau protocol performs similarly to insecure one-hop 
DHTs, but is strongly resistant to Sybil attacks. 
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Abstract 


Early web content was expressed statically, making it 
amenable to straightforward prefetching to reduce user- 
perceived network delay. In contrast, today’s rich web 
applications often hide content behind JavaScript event 
handlers, confounding static prefetching techniques. So- 
phisticated applications use custom code to prefetch data 
and do other anticipatory processing, but these custom 
solutions are costly to develop and application-specific. 

This paper introduces Crom, a generic JavaScript 
speculation engine that greatly simplifies the task of 
writing low-latency, rich web applications. Crom takes 
preexisting, non-speculative event handlers and cre- 
ates speculative versions, running them in a cloned 
browser context. If the user generates a speculated-upon 
event, Crom commits the precomputed result to the real 
browser context. Since Crom is written in JavaScript, it 
runs on unmodified client browsers. Using experiments 
with speculative versions of real applications, we show 
that pre-commit speculation overhead easily fits within 
user think time. We also show that speculatively fetching 
page data and precomputing its layout can make subse- 
quent page loads an order of magnitude faster. 


1 Introduction 


With the advent of web browsing, humans began a 
new era of waiting for slow networks. To reduce user- 
perceived download latencies, researchers devised ways 
for browsers to prefetch content and hide the fetch de- 
lay within users’ “think time” [4, 15, 17, 20, 23]. Find- 
ing prefetchable objects was straightforward because 
the early web was essentially a graph of static ob- 
jects stitched together by declarative links. To discover 
prefetchable data, one merely had to traverse these links. 

In the web’s second decade, static content graphs 
have been steadily replaced by rich Internet applica- 
tions (RIAs) that mimic the interactivity of desktop ap- 
plications. RIAs manipulate complex, time-dependent 
server-side resources, so their content graphs are dy- 
namic. RIAs also use client-side code to enhance inter- 
activity. This eliminates the declarative representation 
of the content graph’s edges, since now content can be 
dynamically named and fetched in response to the exe- 
cution of an imperative event handler. 


1.1 New Challenges to Latency Reduction 


RIAs introduce three impediments to reducing user- 
perceived browser latencies. First, prefetching opportu- 
nities that once were statically enumerable are now hid- 
den behind imperative code such as event handlers. Since 
event handlers have side effects that modify application 
state, they cannot simply be executed “early” to trigger 
object fetches and warm the browser cache. 

Second, user inputs play a key role in naming the con- 
tent to fetch. These inputs may be as simple as the click- 
ing of a button, or as unconstrained as the entry of arbi- 
trary text into a search form. Given the potentially com- 
binatorial number of objects that are nameable by future 
user inputs, a prefetcher must identify a promising subset 
of these objects that are likely to be requested soon. 

Third, RIAs spend a non-trivial amount of time updat- 
ing the screen. Once the browser has fetched the neces- 
sary objects, it must devise a layout tree for those objects 
and render the tree on the display. For modern, graphi- 
cally intensive applications, screen updates can take hun- 
dreds of milliseconds and consume 40% of the processor 
cycles used by the browser [22]. Screen updates con- 
tribute less to page load latencies than network delays do, 
but they are definitely noticeable to users. Unfortunately, 
warming the browser cache before a page is loaded will 
not reduce its layout or rendering cost. 


1.2 Prior Solutions 


To address these challenges, some RIAs use custom 
code to speculate on user intent. For example, email 
clients may prefetch the bodies of recently arrived mes- 
sages, or speculatively upload attachments for emails 
that have not yet been sent. Photo gallery applications 
often prefetch large photos. Online maps speculatively 
download new map tiles that may be needed soon. The 
results page for a web search may prefetch the highest 
ranked targets to reduce their user-perceived load time. 
While such speculative code provides the desired latency 
reductions, it is often difficult to write and tightly inte- 
grated into an application’s code, making it impossible 
to share across applications. 


1.3 Our Solution: Crom 


To ease the creation of low-latency web applications, 
we built Crom, a reusable framework for speculative 
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JavaScript execution. In the simplest case, the Crom API 
allows applications to mark individual JavaScript event 
handlers as speculable. For each such handler h, Crom 
creates a new version h_shadow which is semantically 
equivalent to h but which updates a shadow copy of the 
browser state. During user think-time, e.g., when the 
user is looking at a newly loaded page, Crom makes a 
shadow copy of the browser state and runs h_shadow 
in that context. As h_shadow runs, it fetches data, up- 
dates the shadow browser display, and modifies the appli- 
cation’s shadow JavaScript state. Once h_shadow fin- 
ishes, Crom stores the updated shadow context; this con- 
text includes the modified JavaScript state as well as the 
new screen layout. Later, if the user actually generates 
the speculated-upon event, Crom commits the shadow 
context. In most cases, the commit operation only re- 
quires a few pointer swaps and a screen redraw. This is 
much faster than synchronously fetching the web data, 
calculating a new layout, and rendering it on the screen. 

To constrain the speculation space for arbitrarily- 
valued input elements, applications use mutator func- 
tions. Given the current state of the application, a mu- 
tator generates probable outcomes for an input element. 
For example, an application may provide an autocom- 
pleting text box which displays suggested words as the 
user types. Once a user has typed a few letters, e.g., 
“red”, the application passes Crom a mutator function 
which modifies fresh shadow domains to represent ap- 
propriate speculations; in this example, the mutator may 
set one shadow domain’s text box to “red sox’’, and an- 
other domain’s text box to “red cross’. Later, when 
the user generates an actual event for the input, Crom 
uses application-defined eguivalence classes to deter- 
mine whether any speculative domain is the outcome for 
the actual event and thus appropriate to commit. 

The Crom API contains additional functionality to 
support speculative execution. For example, it provides 
an explicit AJAX cache to store prefetched data that the 
regular browser cache would ignore. It also provides a 
server-side component that makes it easier to specula- 
tively upload client data. Taken as a whole, the Crom li- 
brary makes it easier for developers to reason about asyn- 
chronous speculative computations. 


1.4 Our Contributions 


This paper makes the following contributions: 

e We identify four sources of delay in rich web ap- 
plications, explaining how they can be ameliorated 
through speculative execution ($4). 

e We describe an implementation of the Crom API 
which is written in standard JavaScript and runs on 
unmodified browsers (§5). The library, which is 
65 KB in size, dynamically creates new browser 
contexts, rewrites event handlers to speculatively 
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execute inside of these contexts, and commits them 
when appropriate. 

e We describe three implementation optimizations 
that mitigate JavaScript-specific performance limi- 
tations on speculative execution (§5). 

e Using these optimizations, we demonstrate the fea- 
sibility of browser speculation by measuring the 
performance of three modified applications that use 
the Crom library. We show that Crom’s pre-commit 
speculation overhead is no worse than 114 ms, 
making it feasible to hide speculative computation 
within user think time. We also quantify Crom’s 
reduction of user-perceived latency, showing that 
Crom can reduce load times by an order of mag- 
nitude under realistic network conditions (§6). 

By automating the low-level tasks that support specula- 
tive execution, Crom greatly reduces the implementation 
effort needed to write low-latency web applications. 


2 Background: Client-side Scripting 


The language most widely used for client-side script- 
ing is JavaScript [7], a dynamically typed, object- 
oriented language. JavaScript interacts with the browser 
through the Document Object Model (DOM) [24], a stan- 
dard, browser-neutral interface for examining and ma- 
nipulating the content of a web page. Each element in 
a page’s HTML has a corresponding object in the DOM 
tree. This tree is a property of the document object 
in the global window name space. The browser ex- 
poses the DOM tree as a JavaScript data structure, allow- 
ing client-side applications to manipulate the web page 
by examining and modifying the properties of DOM 
JavaScript objects, often referred to as DOM nodes. 
The DOM allows JavaScript code to find specific DOM 
nodes, create new DOM nodes, and change the parent- 
child relationships among DOM nodes. 

A JavaScript programmer can create other objects be- 
sides DOM nodes. These are called application heap 
objects. All non-primitive objects, including functions, 
are essentially dictionaries that maps property names to 
property values. A property value is either another object 
or a primitive such as a number or boolean. Properties 
may be dynamically added to, and deleted from, an ob- 
ject. The for-—in construct allows code to iterate over 
an object’s property names at run-time. 

Built-in JavaScript objects like a String or a DOM 
node have native code implementations— their methods 
are executed by the browser in a way that cannot be in- 
trospected by application-level JavaScript. In contrast, 
JavaScript can fetch the source code of a user-defined 
method by calling its toString () method. 

JavaScript uses event handlers to make web pages in- 
teractive. An event handler is a function assigned to a 
special property of a DOM node; the browser will in- 
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voke that property when the associated event occurs.! 


For example, when a user clicks a button, the browser 
will invoke the onclick property of that button’s DOM 
object. This gives application code an opportunity to up- 
date the page in response to user activity. 

The AJAX interface [9] allows a JavaScript appli- 
cation to asynchronously fetch web content. AJAX 
is useful because JavaScript programs are single- 
threaded and network operations can be slow. To 
issue an AJAX request, JavaScript code creates an 
XmlLHTTPRequest object and assigns an event handler 
to its onreadystatechange property. The browser 
will call this handler when reply data arrives. 


3 Design 


Crom’s goal is to reduce the developer effort needed to 
create speculative applications. In particular, Crom tries 
to minimize the amount of custom code that develop- 
ers must write to speculate on user inputs like keyboard 
and mouse activity. Crom’s API leverages the fact that 
event-driven applications already have a natural gram- 
mar for expressing speculable user actions—each action 
can be represented as an event handler and a particular 
set of arguments to pass to the handler. Using the Crom 
API, applications can mark certain actions as specula- 
ble. Crom will then automatically perform the low-level 
tasks needed to run the speculative code paths, isolate 
their side-effects, and commit their results if appropriate. 

Crom provides an application-agnostic framework for 
speculative execution. This generality allows a wide va- 
riety of programs to benefit from Crom’s services. How- 
ever, aS a consequence of this generality, Crom requires 
some developer guidance to ensure correctness and to 
prevent speculative activity from consuming too many 
resources. In particular: 

e Speculating on all possible user inputs is computa- 
tionally infeasible. Thus, Crom relies on the devel- 
oper to constrain the speculation space and suggest 
reasonable speculative inputs for a particular appli- 
cation state (§4.1.1 and §5.1.3). 

e Speculative execution should only occur when the 
application has idle resources, e.g., during user 
think time. Crom’s speculations are explicitly ini- 
tiated by the developer, and Crom trusts the de- 
veloper to only issue speculations when resources 
would otherwise lie fallow. 

e A speculative context should only be committed if 
it represents a realizable outcome for the current 
application state. In particular, the initial state of 
a committing speculative context must have been 
equal to the current state of the application. This 
guarantees that the speculative event handler did the 


'This description applies to the DOM Level 0 event model. The 
DOM Level 2 model is also common, but it has incompatible semantics 
across browsers. We do not discuss it in this paper. 


same things that its non-speculative version would 
do if executed now. We call this safety principle 
start-state equivalence. Crom could automatically 
check for this in an application-agnostic way by bit- 
comparing the current browser state with the ini- 
tial state for the speculative context. However, per- 
forming such a comparison would often be expen- 
sive. Thus, Crom requires the developer to leverage 
application knowledge and define an equivalence 
function that determines whether a speculative con- 
text 1s appropriate to commit for a given application 
State (§4.1.1 and §5.1.3). 

e A client-side event handler may generate writes 
to client-side state and server-side state. The de- 
veloper must ensure that client-side speculation 1s 
read-only with respect to server state, or that server- 
side updates issued by speculative code can be un- 
done, e.g., by resetting the “message read” flags on 
an email server (§ 4.1.2). 

Writing speculative code is inherently challenging. 
Crom hides some of the implementation complexity, but 
it does not completely free the developer from reason- 
ing about speculative operations. We believe that Crom 
strikes the appropriate balance between the competing 
tensions of correctness, ease of use, and performance. 

Crom ensures correctness by rewriting speculative 
code to only touch shadow state, and by using developer- 
defined equivalence functions to determine commit 
safety. Crom provides ease of use through its generic 
speculation API. With respect to performance, Crom’s 
goal is not to be as CPU-efficient as custom speculative 
code. Indeed, Crom’s speculative call trees will gener- 
ally be slower than hand-crafted speculation code. Such 
custom code has no rewriting or cloning overhead, and 
it can speculate in a targeted way, eliding the code in an 
event handler call chain that is irrelevant to, say, warming 
a cache. However, Crom’s speculations only need to be 
“fast enough”—they must fit within user think time, be 
quick with respect to network latencies, and not disturb 
foreground computations. If these conditions are satis- 
fied, Crom’s computational inefficiency relative to cus- 
tom speculation code will be moot, and Crom’s specula- 
tions will mask essentially as much network latency as 
hand-coded speculations would. 

In addition to processor cycles, speculative activity re- 
quires network bandwidth to exchange data with servers. 
Crom does not seek to reduce this inherent cost of spec- 
ulation. As with hand-crafted speculative solutions, de- 
velopers must be mindful of network overheads and be 
judicious in how many Crom speculations are issued. 

Figure | lists the primary Crom API. We discuss this 
API in greater detail in the next two sections. Most of 
the technical challenges lie with the implementation of 
Crom.makeSpeculative (), which allows an appli- 
cation to define speculable user actions. 
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Crom.makeSpeculative (DOMnode, eventName, 
mutator,mutatorArgs, stateSketch, 
DOMsubt ree) 


Crom.autosSpeculate 


Register DOMnode.eventName as a speculable 
event handler (§4.1 and §5.1). The mutator, 
mutatorArgs, and stateSketch arguments 
constrain the speculation space and define the 
equivalence classes for commits (§4.1.1 and §5.1.3). 
DOMsubt ree defines a speculation zone (§5.5.2). 
Boolean which determines whether Crom should 
automatically respeculate after committing a prior 
speculation (§4.1.1). 


Crom. forceSpeculations () If autoSpeculate is false, this method forces 
Crom to issue pending speculations. 


Crom.maxSpeculations (N) Limits number of speculations Crom issues (§5.6). 


Crom.createContextPool (N, 
DOMsubt ree) 


Crom. rewriteCG (f) 


Crom.prePOST (forminput, 
specUploadDoneCallback) 


Crom.cacheAdd (key, AJAXdata) 
Crom. cacheGet (key) 


stateSketch, 





Proactively make N speculative copies of the current 
browser context; tag them using the stateSketch 
function (§5.5.3). 


Rewrite a closure-generating function to make it 
amenable to speculation (§5.1.5). 
Collaborates with Crom’s server-side component to 


make a form element speculatively upload data ($4.2 
and §5.4). 


Used to cache speculatively fetched AJAX data that 
would otherwise be ignored by the regular browser 
cache (§4.1.2 and §5.3). 


Figure 1: Crom API calls and configuration settings. 


4 Speculation Opportunities & Techniques 


In this section, we describe four ways that specula- 
tive execution can reduce latency in rich Internet appli- 
cations. We also explain how to leverage the Crom API 
from Figure | to exploit these speculative opportunities. 


4.1 Simple Prefetching 


When a user triggers a heavyweight state change in 
a RIA, the browser does four things. First, it executes 
JavaScript code associated with an event handler. In turn, 
this code fetches data from the local browser cache or 
external web servers. Once the content is fetched, the 
browser determines the new display layout for the page. 
Finally, the browser draws the content on the screen. 

Pulling data across the network is typically the slow- 
est task, so warming the browser cache by specula- 
tively prefetching data can dramatically improve user- 
perceived latencies. For example, a photo gallery or 
an interactive map application can prefetch images to 
avoid synchronous fetches through a high-latency or low- 
bandwidth network connection. As another example, 
consider the DHTMLGoodies tab manager [12], which 
allows applications to create tab pane GUIs. When the 
user clicks a “new tab” link, the tab manager issues an 
AJAX request to fetch the new tab’s content. When the 
AJAX request completes, a callback dynamically cre- 


130 NSDI 710: 7th USENIX Symposium on Networked Systems Design and Implementation 


<div id='tab-container'> 
<div class='dhtmlgoodies_aTab'> 
TALS 12S: “the anatiaal tab: 
<a href='#' id='loadLink'> 
Click to load new tab. 
af a> 
</div> 
<j diy 


<SCripE> 
var link = document.getElementById('loadLink'); 
lank, ,onclick = function) 4 
//Invoke DHTMLGoodies API to make new tab 
createNewTab ('http://www.foo.com'); 


bi 

Crom.makeSpeculative (link, ‘onclick'); 
Crom. forceSpeculations ()j; 

</script> 





Figure 2: Creating a new tab GUI element. 


ates a new display tab and inserts the returned HTML 
into the DOM tree. Figure 2 demonstrates how to make 
this operation speculative. When the application calls 
Crom.makeSpeculative(), Crom automatically 
makes a shadow copy of the current browser state and 
rewrites the onclick event handler, creating a specula- 
tive version that accesses shadow state. When the appli- 
cation calls Crom. forceSpeculations (), Crom 
runs the rewritten handler in the hidden context, fetch- 
ing the AJAX data and warming the real browser cache. 
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Once the browser cache has been warmed, Crom can 
discard the speculative context. However, saving the 
context for future committing provides greater reduc- 
tions in user-perceived fetch latencies ($4.3). 


4.1.1 Speculating on Multi-valued Inputs 


In the previous example, an application speculated on 
an input element with one possible outcome, 1.e., being 
clicked. Other input types can generate multiple specu- 
lable outcomes. For example, a list selector allows one 
of several options to be chosen, and a text box can accept 
arbitrary character inputs. 


To speculate on a multi-valued input, an applica- 
tion passes a mutator function, a state sketch func- 
tion, and a vector of N mutator argument sets 
to Crom.makeSpeculative(). Crom makes N 
copies of the current browser state, and for each argu- 
ment set, Crom runs the rewritten mutator with that ar- 
gument set in the context of a shadow browser environ- 
ment. This generates N distinct contexts, each represent- 
ing a different speculative outcome for the input element. 
Crom passes each context to the sketch function, gener- 
ating an application-defined signature string for that con- 
text. Crom tags each context with its unique sketch and 
then runs the speculative event handler in each context, 
saving the modified domains. Later, when the user actu- 
ally generates an event, Crom determines the sketch for 
the current, non-speculative application state. If Crom 
has a speculative context with a matching tag, Crom can 
safely commit that context due to start-state equivalence. 


Figure 3 shows an example of how Crom specu- 
lates on multi-valued inputs. In this example, we 
use the autocompleting text box from the popular 
script.aculo.us JavaScript library [21]. The first two 
lines of HTML define the text input and the but- 
ton which triggers a data fetch based on the text 
string. We create a custom JavaScript object called 
acManager to control the autocompletion process and 
register it with the script.aculo.us library. The library in- 
vokes acManager.customSelector() whenever 
the user generates a new input character. Once the user 
has typed five characters, acManager uses AJAX to 
speculatively fetch the data associated with each sug- 
gested autocompletion. The state sketch function simply 
returns the value of the search text—a speculative con- 
text 1s committable if its search text is equivalent to that 
of the real domain. 


Note that the code sets Crom. autoSpeculate to 
false, indicating that Crom should not automatically 
respeculate on the handler when a prior speculation for 
the handler commits. The autocompletion logic explic- 
itly forces speculation when the user has typed enough 
text to generate completion hints. 
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<input id='sText' type='text!' /> 

<button id='sButton'>Search</button> 

<Scripr> 

var sText = document.getElementBylId('sText'); 

var sButton = document.getElementBylId('sButton'),; 

sButton.onclick = function () { 
updateDisplayWithAJAxXdata (sText.value) ; 

bi 


Crom.autoSpeculate = false; 
function mutator(arg){ //Set value of text input 
sText.value = arg; 


} 
function sketch (ctx) { 
var doc = ctx.document,; 
var textInput = doc.getElementBylId('sText'); 
return textInput.value; 
} 
var acManager = { 
getHints: function(textSoFar) { 
//Logic for generating autcompletions; 
//returns an array of strings 
by 
customSelector: function (textSoFar) { 
if (textSoFar.length == 5) { 
var hints = this.getHints (textSoFar) ; 
Crom.makeSpeculative (sButton, 
‘onclick*,mutator, hints, Sketch)? 
Crom. tforcespeculations () ; 


} 
//Display autocompletions to user... 
} 
by 
new Autocompleter('sText', acManager),; 
</SCript> 


Figure 3: Speculating on an autocompletion event. 


4.1.2 Separating Reads and Updates 


In some web applications, pulling data from a server 
has side effects on the client and the server. In these sit- 
uations, speculative computations must not disturb fore- 
ground state on either host. For example, the Decimail 
webmail client [5] uses AJAX to wrap calls to an IMAP 
server. The fet chNewMessage() operation updates 
client-side metadata (e.g., a list of which messages have 
been fetched) and server-side metadata (e.g., which mes- 
sages should be marked as seen). 


To speculate on such a read/write operation, the de- 
veloper must explicitly decompose it into a read portion 
and a write portion with respect to server state. For ex- 
ample, to add Crom speculations to the Decimail client, 
we had to split the preexisting fet chNewMessage () 
operation into a read-only downloadMessage () 
and a metadata-writing markMessageRead(). The 
read-only operation downloads an email from the 
server, but specifies in the IMAP request that the 
server should not mark the message as seen. The 
markMessageRead() tells the server to update this 
flag, effectively committing the message fetch on the 
server-side. Inside fetchNewMessage(), the call 
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<form action='server.com/recv.py' 
<div> 
<label>File 1:</label> 
<input type='file' id='fInput'/> 
</div> 
—aiy> 
<input type='submit' 
</div> 
</torm> 


value=" Send,-datra!*/7> 


Scripts 
var fInput = document.getElementBylId('fiInput'); 
Crom.speculativePOST (fInput, 

function() {alert ('File uploaded!')}; 


</SCri pe 





Figure 4: Making a POST operation speculative. 


to markMessageRead() is conditioned on whether 
fetchNewMessage () 1s running speculatively; code 
can check whether this is true by reading the special 
Crom. isSpecEx boolean. 


Although downloadMessage() may be 
read-only with respect to the server, it may up- 
date client-side JavaScript state. So, when spec- 
ulating on fetchNewMessage(), we run 
downloadMessage() in a speculative execution 
context. In speculative or non-speculative mode, 
downloadMessage() places AJAX _ responses 
into a new cache provided by Crom. Later, when 
downloadMessage () runs in non-speculative mode, 
it checks this cache for the message and avoids a refetch 
from the server. 


Like the regular browser cache, Crom’s AJAX cache 
persists across speculations (although not application 
reloads). The regular cache will store AJAX results 
containing “expires” or “cache control’ headers [6], so 
an application-level AJAX cache may seem superfluous. 
However, some AJAX servers do not provide caching 
headers, making it impossible to rely on the regular cache 
to store AJAX data if the client side of the application is 
developed separately from the server side. Examples of 
such scenarios include mash-ups and aggregation sites. 


4.2 Pre-POSTing Uploads 


Prefetching allows Crom to hide download latency. 
However, in some situations, such as Decimail’s attach- 
file function, the user is stalled by upload (HTTP POST) 
delays. To hide this latency, Crom’s client and server 
components cooperate to create a POST cache. When 
a user specifies a file to send, Crom speculates that the 
user will later commit the send. Crom asynchronously 
transfers the data to the server’s POST cache. Later, if 
the user commits the send, the asynchronous POST will 
be finished (or at least already in-progress). 

Figure 4 demonstrates how to make a POST operation 
speculative. The web application simply registers the rel- 
evant input element with the Crom library. Once the 
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user has selected a file, Crom automatically starts up- 
loading it to the server. When the speculative upload 
completes, Crom invokes an optionally provided call- 
back function; this allows the application to update fore- 
ground (1.e., non-speculative) GUI state to indicate that 
the file has safely reached the server. 

Speculative uploading is not a new technique, and it 
is used by several popular services like GMail. Crom’s 
contribution is providing a generic framework for adding 
speculative uploads to non-speculative applications. 


4.3 Saving Client Computation 


When an application updates the screen, the browser 
uses CPU cycles for layout and rendering. During lay- 
out, the browser traverses the updated DOM tree and de- 
termines the spatial arrangement of the elements. Dur- 
ing rendering, the browser draws the laid-out content on 
the screen. Speculative cache warming can hide fetch 
latency, but it cannot hide layout or rendering delays. 

Crom stores each speculative browser context inside 
an invisible <iframe> tag. As a speculative event han- 
dler executes, it updates the layout of its corresponding 
iframe. When the handler terminates, Crom saves the 
already laid-out iframe. Later, if the user generates 
the speculated-upon event, Crom commits the specula- 
tive DOM tree in the iframe to the live display, paying 
the rendering cost but avoiding the layout cost. The re- 
sult is a visibly smoother page load. 


4.4 Server Load Smoothing 


Some client delays are due not to network delays, but 
to congestion at the server due to spiky client loads. Us- 
ing selective admission control at the server, speculative 
execution spreads client workload across time, just as 
speculation plus differentiated network service smooths 
peak network loads [3]. When the server is idle, specula- 
tions slide requests earlier in time, and when the server is 
busy, speculative requests are rejected and the associated 
load remains later in time. This paper does not explore 
server smoothing further, but the techniques described 
above, together with prioritized admission control at the 
server, should adequately expose this opportunity. 


5 Implementation 


The client-side Crom API could be implemented in- 
side the browser or by a regular JavaScript library. For 
deployability, we chose the latter option. In this sec- 
tion, we describe our library implementation and the op- 
timizations needed to make it performant. 


5.1 


To create a speculative version of an event han- 
dler bound to DOM node d, an application calls 
Crom.makeSpeculative(d, eventName); Fig- 
ures 2 and 3 provide sample invocations of this function. 


Making Event Handlers Speculative 
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The makeSpeculative() method does two things. 
First, it creates a shadow browser context for the specula- 
tive computation. Second, it creates a new event handler 
that performs the same computation as the original one, 
but reads and writes from the speculative context instead 
of the real one. We discuss context cloning and function 
rewriting in detail below. 


5.1.1 The Basics of Object Copying 


Crom clones different types of objects using different 
techniques. For primitive values, Crom just returns the 
value. For built-in JavaScript objects like Dates, Crom 
calls the relevant built-in constructor to create a semanti- 
cally equivalent but referentially distinct object. 

JavaScript functions are first-class objects. Calling 
a function’s toString() method returns the func- 
tion’s source code. To clone a function £, Crom calls 
eval (f.toString()), using the built-in eval () 
routine to parse the source and generate a semantically 
equivalent function. Like any object, £ may have proper- 
ties. So, after cloning the executable portion of £, Crom 
uses a for—in loop to discover f’s properties, copying 
primitives by value and objects using deep copies. 

To clone a non-function object, Crom creates an ini- 
tially empty object, finds the source object’s properties 
using a for-—in loop, and copies them into the target 
object as above. Since object graphs may contain cycles, 
Crom uses standard techniques from garbage collection 
research [11] to ensure that each object is only copied 
once, and that the cloned object graph is isomorphic to 
the real one. 


To clone a DOM tree with a root node n, Crom calls 
the native DOM method n.cloneNode (true), 
where the boolean parameter indicates that n’s 
DOM children should be recursively copied. The 
cloneNode() method does not copy event handlers 
or other application-defined properties belonging to a 
DOM node. Thus, Crom must copy these properties 
explicitly, traversing the speculative DOM tree in 
parallel with the real one and updating the properties 
for each speculative node. Non-event-handler properties 
are deep-copied using the techniques described above. 
Since Crom rewrites handlers and associates special 
metadata with them, Crom assumes that user-defined 
code does not modify or introspect event handlers. So, 
Crom shallow-copies event handlers by reference. 


5.1.2 Cloning the Entire Browser State 


To clone the whole browser context, Crom first copies 
the real DOM tree. Crom then creates an invisible 
<iframe> tag, installing the cloned DOM tree as the 
root tree of the iframe’s document object. Next, 
Crom copies the application heap, which is defined as 
all JavaScript objects in the global namespace and all 


objects reachable from those roots. Crom discovers the 
global properties using a for—in loop over window. 
Crom deep-copies each of these properties and inserts 
the cloned versions into an initially empty object called 
specContext. specContext will later serve as the 
global namespace for a speculative execution. 


Global properties can be referenced with 


or without the window. prefix. To prevent 
window.globalVar from falling through to 
the real window object, Crom adds a_ prop- 
erty to specContext called window _ that 


points to specContext. Crom also adds a 
specContext.document property that points 
to the hidden <iframe>’s document object. As we 
explain in Section 5.1.4, this forces DOM operations in 
the speculative execution to touch the speculative DOM 
tree instead of the real one. 


5.1.3 Commit Safety & Equivalence Classes 


As described so far, a shadow context is initialized 
to be an exact copy of the browser state at clone time. 
This type of initialization has an important consequence: 
it prevents us from speculating on user intents that are 
not an immediate extension of the current browser state. 
For example, a text input generates an onchange event 
when the user types some characters and then shifts input 
focus to another element. If the text input is empty when 
Crom creates a speculative domain, Crom can specula- 
tively determine what the onchange handler would do 
when confronted with an empty text box. However, if 
Crom creates shadow contexts as exact copies of the cur- 
rent browser state, Crom has no way to speculate on what 
the handler would do if the user had typed, say, “val- 
halla” into the text input. 


To address this problem, we allow  applica- 
tions to provide three additional arguments to 
Crom.makeSpeculative(): a mutator func- 
tion, a mutator argument vector, and a state sketch 
function. Section 4.1.1 provides an overview of these 
parameters. Here, we only elaborate on the sketch 
function and its relationship to committability. 

The sketch function accepts a global namespace, spec- 
ulative or real, and returns a unique string identifying 
the salient application features of that name space. Each 
speculative context is initially a perfect copy of the real 
browser context at time to. Thus, at tg, before any spec- 
ulative code has run, the new speculative context has the 
same sketch as the real context. Crom tags each specu- 
lative context with the state sketch of its source context. 
Later, at time ¢;, when Crom must decide whether the 
speculative context is committable, it calculates the state 
sketch for the real browser context at t;. A speculative 
context is only committable if its sketch tag matches the 
sketch for the current browser context. This ensures that 
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the speculative context started as a semantically equiva- 
lent copy of the current browser state, and therefore rep- 
resents the appropriate result for the user’s new input. 

State sketches provide a convenient way for appli- 
cations to map semantically identical but bit-different 
browser states to a single speculable outcome. For 
example, the equivalence function in Figure 3 could 
canonicalize the search strings blue\tbook and 
blue\t\tbook to the same string. 


5.1.4 Rewriting Handlers 


After creating a speculative browser context, Crom 
must create a speculative version of the event handler, 
1e., one that is semantically equivalent to the orig- 
inal but which interacts with the speculative context 
instead of the real one. To make such a function, 
Crom employs JavaScript’s with statement. Inside a 
with (ob j){...} statement, the properties of obj are 
pushed to the front of the name resolution chain. For 
example, if obj has a property p, then references to p 
touch obj .p rather than a globally defined p. 

To create a speculative version of an event handler, 
Crom fetches the handler’s source code by calling its 
toString() method. Next, Crom alters the source 
code string, placing it inside a with (specContext) 
statement. Finally, Crom uses eval () to generate a 
compiled function object. When Crom executes the new 
handler, each handler reference to a global property will 
be directed to the cloned property in specContext. 

The with() statement binds lexically, so if 
the original event handler calls other functions, 
Crom must rewrite those as well. Crom does 
this lazily: for every function or method call 
£() inside the original handler, Crom inserts a 
new variable declaration var _rewritten_f = 
Crom.rewriteFunction(f, specContext);, 
and replaces calls to £() with calls’ to 
_rewritten_f (). 

The document object mediates application access 
to the DOM tree. specContext.document points 
to the shadow DOM tree, so speculative DOM oper- 
ations can only affect speculative DOM state. Since 
document methods do not touch application heap ob- 
jects, Crom does not need to rewrite them. 

The names of function parameters may shadow those 
of global variables. Speculative references to these vari- 
ables should not resolve to specContext. When 
rewriting functions with shadowed globals, Crom passes 
a new speculative scope to with() statements; this 
scope is a copy of specContext that lacks references 
to shadowed globals. 

If speculative code creates a new global property, 
it may slip past the with statement into the real 
global namespace. Fortunately, JavaScript is single- 
threaded, so Crom can check for new globals after the 
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function genClosure (x) { 
function £() falert (x++) ¢ } 
return f; 


} 

var closureFunc 
button’, onclick 
bucEtonl. onclick 


= genClosure (0); 
closureFunc; 
cLOSsSuUrerunc; 





Figure 5: Variable x persists in a hidden closure scope. 


function genClosure (x) { 
var _ cindex = Crom.newClosureld(); 
—__closureEnvironment [__cIndex] 
LUNeCLioOn Ltt) 4 


x = x: 


alert (__closureEnvironment [__cIndex].x++); 


} 
f.. cindex = __ cindex; 
return f; 


Figure 6: The rewritten closure scope is explicit. 


speculative handler has finished and sweep them into 
specContext before they are seen by other code. 

When speculative code deletes a global property, 
this only removes the property from specContext. 
When the speculation commits, this property must also 
be deleted from the global namespace. To accomplish 
this, Crom rewrites delete statements to additionally 
collect a list of deleted property names. If the specula- 
tion later commits, Crom removes these properties from 
the real global namespace. 


5.1.5 Externally Shared Closures 


Whenever a function is created, it stores its lexical 
scope in an implicit activation record, using that scope 
for subsequent name resolution. Unfortunately, these ac- 
tivation records are not introspectable. If they escape 
cloning, they become state shared with the real (1.e., non- 
speculative) browser context. To avoid this fate, Crom 
rewrites closures to use explicit activation objects. Later, 
when Crom creates a speculative context, it can clone the 
activation object using its standard techniques. 

Consider the code in Figure 5, which creates a clo- 
sure and makes it the event handler for two different but- 
tons. During non-speculative execution, clicking either 
button updates the same counter inside the shared clo- 
sure. However, closureFunc.toString() merely 
returns “function (){alert(x++);}”, with no mention of 
x’s closure binding. Using the rewriting techniques de- 
scribed so far, a rewritten handler would erroneously 
look for x in the global scope. 

Figure 6 shows genClosure () rewritten to use an 
explicit activation record. Crom.newClosurelId() 
creates an empty activation record, pushes it onto a 
global array _.closureEnvironments, and returns 
its index. Crom rewrites each property that implicitly 
references the closure scope to explicitly reference the 
activation record via a function property _.cIndex. 
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Later, when rewriting the button0 or buttonl 
event handler, Crom detects that the handler is a closure 
by its _.cIndex property. If the value of £ ._.cIndex 
is, say, 2, Crom rewrites the closure function by adding 
the statement var _.cIndex = 2; immediately be- 
fore the with (specContext) in the string passed to 
eval(). This gives the rewritten handler enough state 
to access the proper explicit activation record. Like any 
other global, _closureEnvironments 1s specula- 
tively cloned, so each speculative execution has a private 
snapshot of the state of all closures. 

Currently, applications must explicitly invoke Crom to 
rewrite functions which return closures. They do this 
by executing g = Crom.rewriteCG(g) for each 
closure-generating function g. Future versions of Crom 
will use lexical analysis to perform this rewriting auto- 
matically. 


5.2 Committing Speculative State 


Committing a speculative context is straightforward. 
First, Crom updates the DOM tree root in the non- 
speculative context, making it point to the DOM tree 
in the committing speculation’s hidden iframe. Next, 
Crom updates the heap state. A for-—in loop enu- 
merates the heap roots in specContext and assigns 
them to the corresponding properties in the real global 
name space; this moves both updated and newly created 
roots into place. Finally, Crom iterates through the list 
of global properties deleted by the speculative execution 
and removes them from the real global name space. 


5.3. AJAX Caching 


To support the caching of AJAX results, 
client-side programs call the Crom methods 
Crom.cacheAdd (key, AJAXresult) and 


Crom.cacheGet (key). These methods allow 
applications to associate AJAX results with arbitrary 
cache identifiers. Like the regular browser cache, 
Crom’s AJAX cache is accessible to speculative and 
non-speculative code. For example, in the modified 
Decimail client, the event handler for the “fetch new 
message” operation is broken into two _ functions, 
downloadMessage() andmarkMessageRead(). 
downloadMessage() is read-only on the server 
side, but it modifies client-side state, e.g., a JavaScript 
array that contains metadata for each fetched message. 
Thus, when the Decimail client speculates on a message 
fetch, it rewrites downloadMessage()’s call tree 
and runs it in a speculative context. The speculative 
downloadMessage() looks for the message in 
Crom’s cache and does not find it. It fetches the new 
email using AJAX and inserts it into Crom’s cache. 
Later, when the user actually triggers the “fetch new 
message” handler, downloadMessage() runs in 


non-speculative mode and finds the requested email in 
the cache. Decimail then calls markMessageRead () 
to inform the server of the user’s action. 


5.4 Speculative Uploads 


An upload form typically consists of a file input text 
box and an enclosing submit form. After the user types 
a file name into the text box, the input element generates 
an onchange event. However, the file is not uploaded 
to the server until the user triggers the onsubmit event 
of the enclosing form, typically by clicking a button in- 
side the form. At this point, the application’s onsubmit 
handler is called to validate the file name. Unless this 
handler returns false, the browser POSTs the form to 
the server and sends the server’s HTTP response to the 
target of the form. By default, the target is the current 
window; this causes the browser to overwrite the current 
page with the server’s response. 

Crom implements speculative uploads with a client/ 
server protocol. On the client, the developer specifies 
which file input should be made speculative by call- 
ing Crom.prePost (fileInput, callback). In- 
side prePost (), Crom saves a reference to any user- 
specified onsubmit form-validation handler, since 
Crom will supply its own onsubmit handler shortly. 
Crom installs an onchange event handler for the file 
input which will be called when the user selects a file to 
upload. The handler creates a cloned version of the up- 
load form in a new invisible iframe, with all file inputs 
removed except the one representing the file to specula- 
tively upload. If the application’s original onsubmit 
validator succeeds, Crom’s onchange handler POSTs 
the speculative form to a server URL that only ac- 
cepts speculative file uploads. Crom’s server component 
caches the uploaded file and its name, and the client com- 
ponent records that the upload succeeded. 

Crom.prePost () also installs an onsubmit han- 
dler that lets Crom introspect the form before a real 
click would POST it. If Crom finds a file that has al- 
ready been cached at the server, Crom replaces the as- 
sociated file input with an ordinary text input having the 
value ALREADY_SENT : filename. Upon receipt, Crom’s 
server component inserts the cached file data before pass- 
ing the form to the application’s server-side component. 

The interface given above is least invasive to the ap- 
plication, but a speculation-aware application can pro- 
vide upload progress feedback to the user by registering 
a progress callback with Crom. Crom invokes this han- 
dler in the real domain when the speculative upload com- 
pletes, allowing the application to update its GUI. 


5.5 Optimizations 


To conclude this section, we describe three techniques 
for reducing speculative cloning overheads. 
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5.5.1 Lazy Cloning 


For complex web sites, eager cloning of the entire ap- 
plication heap may be unacceptably slow. Thus, Crom 
offers a lazy cloning mode in which objects are only 
copied when a speculative execution is about to access 
them. Since this set of objects is typically much smaller 
than the set of all heap objects, lazy cloning can produce 
significant savings. 

In lazy mode, Crom initially copies only the DOM 
tree and the heap variables referenced by DOM nodes. 
As the speculative computation proceeds, Crom dynam- 
ically rewrites functions as before. However, object 
cloning 1s now performed as a side effect of the rewriting 
process. Crom’s lexical analysis identifies which vari- 
able names refer to locals, globals, and function param- 
eters. Locals do not need cloning. Strictly speaking, 
Crom only needs to clone globals that are written by a 
speculative execution; reads can be satisfied by the non- 
speculative objects. However, a global that is read at one 
point in a call chain may be passed between functions as 
a parameter and later written. To avoid the bookkeeping 
needed to track these flows, Crom sacrifices performance 
for implementation simplicity and clones a global vari- 
able whenever a function reads or writes it. The function 
is then rewritten using the techniques already described. 
Function parameters need not be cloned as such—if they 
represent globals, they will be cloned by the ancestor in 
the call chain that first referenced them. 

Lazy cloning may introduce problems at commit time. 
Suppose that global objects named X and Y have prop- 
erties X.P and Y.P that refer to the same underlying 
object obj. If a speculative call chain writes to X.P, 
Crom will deep-copy X, cloning obj via X.P. The spec- 
ulation will write to the new clone obj’. If the call 
chain never accesses Y, Y will not be cloned since it is 
not reachable from the object tree rooted at X. Later, if 
the speculation commits, Crom will set the real global X 
to specContext .X, ensuring that the real X . P points 
to obj’. However, Y.P will refer to the original (and 
now stale) obj. 

In practice, we have found that such stale references 
arise infrequently in well-designed, modular code. For 
example, the autocompletion widget is a stand-alone 
piece of JavaScript code. When a developer inserts a 
speculative version of it into an enclosing web page, 
the widget will not be referenced by other heap vari- 
ables, and running in lazy mode as described will not 
cause stale child references. Regardless, to guarantee 
correctness at commit time, Crom provides a checked 
lazy mode. Before Crom issues any speculative computa- 
tions, it traverses every JavaScript object reachable from 
the heap roots or the DOM tree, and annotates each ob- 
ject with _parents, a list of parent pointers. A parent 
pointer identifies the parent object and the property name 
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by which the parent references the child. Crom copies 
_parents by reference when cloning an object. When 
a lazy speculation commits, Crom uses the _parents 
list to update stale child references in the original heap. 

Crom also has an unchecked lazy mode in which Crom 
clones lazily but assumes that stale child references never 
occur, thereby avoiding the construction and the check- 
ing of the parent map. For applications with an extremely 
large number of objects and/or a highly convoluted ob- 
ject graph, unchecked lazy mode may be the only cloning 
technique that provides adequate performance. We re- 
turn to this issue in the evaluation section. For now, 
we make three observations. First, our experience has 
been that the majority of event handlers will run safely 
in unchecked mode without modification. Second, Crom 
provides an interactive execution mode that allows devel- 
opers to explicitly verify whether their speculative event 
handlers are safe to run in unchecked lazy mode. Dur- 
ing speculative commits in interactive mode, Crom re- 
ports which committing objects have parents that were 
not lazily cloned and thus would point to stale children 
post-commit. Crom automatically determines the object 
tree roots that the programmer must explicitly reference 
in the event handler to ensure that the appropriate ob- 
jects are cloned. The programmer can then perform this 
simple refactoring to make the handler safe to run in 
unchecked lazy mode. 

Third, and most importantly, Section 6 shows that 
most of Crom’s benefits arise from speculatively warm- 
ing the browser cache. Committing speculative DOM 
nodes and heap objects can mask some computational 
latency, but network fetch penalties are often much 
worse. Thus, if speculating in checked lazy mode is 
too slow, or checked mode refactoring is too painful, an 
application can pass a flag (not shown in Figure 1) to 
Crom.makeSpeculative() which instructs Crom 
to discard speculative contexts after the associated ex- 
ecutions have terminated. In this manner, applications 
can use unchecked lazy speculations solely to warm the 
browser cache, forgoing complications due to commit 1s- 
sues, but deriving most of the speculation benefit. 


5.5.2 Speculation Zones 


An event handler typically modifies a small frac- 
tion of the total application heap. Similarly, it of- 
ten touches a small fraction of the total DOM tree. 
Lazy cloning exploits the first observation, and spec- 
ulation zones exploit the second. An application 
may provide an optional DOMsubtree parameter to 
Crom.makeSpeculative() that specifies the root 
of the DOM subtree that an event handler modifies. At 
speculation time, Crom will only clone the DOM nodes 
associated with this branch of the tree. At commit time, 
Crom will splice in the speculative DOM branch but 
leave the rest of the DOM tree undisturbed. 
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Speculation zones are useful when an event handler 
is tightly bound to a particular part of the visual dis- 
play. For example, the autocompletion widget in Fig- 
ure 3 only modifies the <div> tag that will hold the 
fetched search results. Speculation zones are also useful 
when the DOM tree contains rich objects whose inter- 
nal state is opaque to the native cloneNode () method. 
Examples of such objects are Flash movies and Java ap- 
plets. When cloneNode () is invoked upon such an 
object, the returned clone is “reset” to the initial state of 
the source object; for example, a Flash movie is rewound 
to its first frame. Since it is difficult for JavaScript code 
to reason about the state of such objects, applications 
can use speculation zones to “speculate around” these 
Opaque parts of the DOM tree. 


5.5.3 Context Pools 


Crom hides speculative computations within the 
“think time” of a user. The shorter the think time, the 
faster Crom must launch speculations to reduce user- 
perceived fetch latencies. For example, time pressure 
is comparatively high for the autocompletion widget 
(§4.1.1) since a fast typist may generate her search string 
quickly. 

To reduce the synchronous overhead of is- 
suing new _ speculations, applications can call 
Crom.createContextPool(N, stateSketch, 
DOMsubtree). This method generates and caches N 
clones of the current browser environment, tagging each 
with the sketch value generated when stateSketch 
is passed the current browser context. 

Using context pools, the entire cost of eager cloning 
or the initial cost of lazy cloning is paid in advance. At 
speculation time, Crom finds an unused context that is 
tagged with the application’s current sketch. Crom im- 
mediately specializes it using a mutator and issues the 
new speculation. 


5.6 Limitations and Future Work 


Many of the limitations of the current Crom proto- 
type arise from the fact that the client-side portion runs 
as slow JavaScript code instead of fast C++ code inside 
the browser. For example, we pursued the optimiza- 
tions described in Section 5.5 after discovering that in 
many cases, copying the entire browser context using 
JavaScript code would result in unacceptably slow per- 
formance. Ideally, browsers would natively support con- 
text cloning, and applications could always use checked 
mode speculations without fear of excessive CPU usage. 

Speculative fetches compete with non-speculative 
fetches for bandwidth. A native implementation of 
Crom could measure the traffic across multiple flows 
and prioritize non-speculative fetches over speculative 
ones. Our JavaScript implementation cannot mea- 


sure such browser-wide network statistics. However, 
developers can use Crom.maxSpeculations (N) 
to place an upper limit on the number of spec- 
ulations that can be triggered by a single call to 
Crom. forceSpeculations(). 

As mentioned in Section 5.5.2, opaque browser ob- 
jects like applets and Flash applications cannot be intro- 
spected by JavaScript code. Our JavaScript Crom library 
can only clone them in a crude fashion by recreating their 
HTML tags (and thereby reinitializing the clones to a vir- 
gin state). In practice, we expect most applications to 
avoid this issue by using speculation zones. However, 
some applications might benefit from the ability to clone 
rich objects in more sophisticated ways. An in-browser 
implementation of Crom could do this, but we leave an 
exploration of these mechanisms for future work. 

The parsing engine for our client-side rewriter is fairly 
unsophisticated. Implemented with regular expressions, 
it is sufficient for analyzing the real applications de- 
scribed in the next section. However, it does not parse 
the complete JavaScript grammar. Future versions of the 
client-side library will use an ANTLR-driven parser [18]. 
An in-browser implementation could obviously reuse the 
browser’s native parsing infrastructure. 


6 Evaluation 


Ideally, we would evaluate Crom by taking a preexist- 
ing application that has custom speculation code, replac- 
ing that code with Crom calls, and comparing the perfor- 
mance of the two versions. Unfortunately, custom specu- 
lation code is often tightly integrated with the rest of the 
application’s code base, making it difficult to remove the 
code in a principled way and provide a fair comparison 
with Crom’s speculation API. Thus, our evaluation ex- 
plores the performance of the real applications from Sec- 
tion 4 that we modified to use the Crom API. We show 
that Crom’s computational overheads are hideable within 
user think time, and that Crom can reduce user-perceived 
fetch latencies by an order of magnitude under realistic 
network conditions. 

Our Crom prototype has been tested most extensively 
on the Firefox 3.5 browser. Its core functionality has also 
been tested on IE7 and Safari 4.0. However, we are still 
fixing compatibility issues with the latter set of browsers, 
so this section only contains Firefox results. 


6.1 Performance of Modified Applications 


To test the speculative autocompletion widget and tab 
manager, we downloaded real web pages onto our lo- 
cal web server; Figure 7 lists the pages that we exam- 
ined. We inserted the Crom library and the speculative 
applications into the pages, then loaded the pages using 
a browser to test their performance. Decimail is a stand- 
alone application, so we tested it by itself, 1.e., we did not 
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Web site | Primitives | Objects | Functions | DOM 
noe 


Google 
Gmail-o 
Gmail-i 


ESPN 
YouTube 


Live 
Yahoo 
MySpace 
eBay 
MSN 
Amazon 
CNN 





Figure 7: The types of JavaScript variables in several 
popular web pages. Gmail-o refers to Gmail’s outer con- 
trol frame and Gmail-i refers to the inner frame which 
displays the inbox message list. 
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Figure 8: Latency reductions using Crom’s AJAX cache. 


embed it within an enclosing web page. Stripped of com- 
ments and extraneous whitespace, Crom’s JavaScript 
code added 65 KB to each application’s download size. 
All experiments ran on an HP xw4600 workstation with 
a dual-core 3GHz CPU and 4 GB of RAM. Web con- 
tent was fetched from a custom localhost web server that 
introduced tunable fetch delays. This allowed us to mea- 
sure Crom’s latency-hiding benefits as a function of the 
fetch penalty. All experiments represent the average of 
10 trials. Standard deviations were less than 6% in all 
cases. 


6.1.1 Decimail Client 


The modified Decimail client [5] used Crom’s AJAX 
cache to store results from speculative mail fetches. Run- 
ning in unchecked lazy mode, Decimail’s cloning and 
rewriting overheads were less than 5 ms per fetch, so we 
elide further discussion of them. Figure 8 shows the ben- 
efit of finding a requested message in the Crom cache in- 
stead of having to fetch it synchronously. Unsurprisingly, 
a cache hit took no more than 3 ms to serve, whereas the 
cache miss penalty was the fetch latency to the server. 
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Figure 9: Tab manager pre- and post-commit overheads 
(checked mode costs in grey). 
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Figure 10: User-perceived latencies for tab manager. 


6.1.2 Tab Manager 


Figure 9 depicts the speculative overheads in the tab 
manager application [12]. In this experiment, the tab 
manager was embedded within the ESPN front page. 
Clicking the “make new tab” button generated an AJAX 
request for a web page. Once fetched, the page was ren- 
dered inside a <div> tag; this tag was a child of the en- 
closing <div> tree controlled by the tab manager. Crom 
used this enclosing tag as the speculation zone. We con- 
figured the tab manager to speculatively fetch the CNN 
home page, a complex page with over 1700 DOM nodes. 

Figure 9 breaks the speculation overheads into pre- 
commit costs (the left side of the dotted line) and during- 
commit costs (the right side of the line). The black bars 
represent the overheads for unchecked lazy mode. The 
grey bars represent the additional costs that would arise 
from running lazy cloning in checked mode. 

Unchecked lazy mode: When _ speculating in 
unchecked lazy mode, aggregate pre-commit spec- 
ulation costs were low, totalling 24 ms. Copying the 
speculated-upon DOM tree was fast (1 ms), since the 
speculation zone consisted of a small <div> tree of 
depth 3. Walking the cloned DOM tree and copying 
event handlers and user-defined objects took only 7 ms. 
There was no mutator to rewrite, but Crom did have to 
rewrite an event handler call tree with a maximum depth 
of 4. Figure 9 breaks the rewriting cost into two parts. 
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The first part represents the cost of lexical analysis, 
source code modification, and calling eval () on the 
modified source. The second part represents the cost 
of copying heap objects during the rewriting process. 
Both costs were less than 10 ms each since the call chain 
touched a minority of the total page state. 

At commit time, there were no stale child references 
to fix since Crom was running in unchecked mode. Com- 
mitting speculative heap objects took less than | ms, 
since Crom merely had to make global variables in the 
real domain point to speculative object trees. Over 99% 
of the commit overhead was generated by the splice of 
the speculative DOM subtree into the non-speculative 
one. Although the splice was handled by native code, 
it required a screen redraw, since the formerly invisible 
CNN context was now visible. Screen redraws are one 
of the most computationally intensive browser activities. 
However, Crom avoided the additional (and more expen- 
sive) reflow cost, since the layout for the CNN content 
was determined during the speculative execution. Since 
even non-speculative computations must pay the redraw 
cost upon updating the screen, Crom added less than 1 
ms to the inherent cost of displaying the new content. 

Checked lazy mode: In this mode, Crom built a par- 
ent map before issuing any speculations and checked for 
stale object references at commit time. The grey bars in 
Figure 9 depict these costs, which must be paid in ad- 
dition to the rewriting and cloning costs in black. Con- 
structing the parent mapping took 158 ms, leading to an 
overall pre-commit cost of 182 ms. Although the map- 
ping cost is amortizable across multiple speculations, 
an aggregate CPU overhead of 182 ms pushes against 
the limits of acceptability. The current implementation 
of Crom does not stage its pre-commit operations, 1.e., 
Crom holds the processor for the entire duration of the 
cloning, rewriting, and parent mapping process. De- 
pending on the application, a pre-commit overhead of 
182 ms may interfere with foreground, non-speculative 
JavaScript that needs CPU cycles. Recent versions of 
Firefox provide a JavaScript yield statement for im- 
plementing generators; future versions of Crom may be 
able to stage pre-commit costs using such generators. 

When committing a checked-mode speculation, Crom 
must patch stale object references. In this experiment, 
the cost was only 5 ms. As explained in Section 6.2.4, 
the patching overhead is proportional to the number of 
objects cloned, which is typically much smaller than the 
total number of objects in the application. 

User-perceived benefits: To keep the GUI responsive, 
long redraws in Firefox are split into a synchronous part 
and several asynchronous ones. The commit-time black 
bar in Figure 9 only captures the synchronous cost. Fig- 
ure 10 quantifies the end-to-end user-perceived latency 
reduction that is enabled by speculative prefetching and 
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Figure 11: Autocompleter pre- and post-commit costs 
(checked-mode costs in grey). 
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Figure 12: User-perceived latencies for autocompleter. 


pre-layout. The injected web server latency varies on 
the x-axis, and the y-axis provides the delay between 
the start of the page load and the final screen repaint; 
repaint activity was detected using the custom Firefox 
event MozAfterPaint. 


Figure 10 shows that speculative execution can dra- 
matically reduce user-perceived latencies. For exam- 
ple, with a 300 ms fetch penalty, Crom needed 399 ms 
to complete the page load, whereas a non-speculative 
synchronous load with a cold cache required 3,427 ms. 
Speculatively prefetching the data but discarding the lay- 
out (1.e., not committing the speculative context) re- 
quired 1,302 ms of user-perceived load time. The load 
penalty relative to the full speculative case arose from 
two sources: recalculating the layout, and for some ob- 
jects, waiting for the server to respond to cache validation 
messages. Such messages are not generated when a pre- 
computed DOM subtree is inserted into the foreground 
DOM tree. 


6.1.3 Autocompletion Widget 


To evaluate the speculation overheads for the auto- 
completer [21], we embedded it inside the Amazon front 
page, simulating a new search box for the page. When 
the user hit the widget’s “submit” button, the widget used 
AJAX to fetch and display content associated with the 
user’s search terms. The autocompleter speculated after 
the user had typed two letters, and we limited the number 
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of speculations to three. To provide a comparison with 
the tab manager experiment, the widget always fetched 
the CNN front page. Crom ran in lazy mode with a spec- 
ulation zone consisting of the <div> tag that would dis- 
play the fetched content. 

Figure 11 depicts the speculation overheads. Examin- 
ing the overheads from left to right, we see that copying 
the DOM tree, the event handlers, and the user objects 
was extremely fast. This is because the autocompleter 
used a speculation zone that limited the DOM copying 
to a single div subtree. The mutator function was also 
very simple, so rewriting it was cheap. However, rewrit- 
ing the event handler call tree was not cheap due to high 
cloning costs. The JavaScript objects that implemented 
the autocompletion logic were fairly complex, and copy- 
ing the whole set took 33 ms. Since the autocompleter 
issued three speculations, this object set had to be copied 
three times. 

In unchecked lazy mode, the total pre-commit specu- 
lation cost was 114 ms, not counting the 22 ms needed to 
preemptively create three speculative domains (see Sec- 
tion 5.5.3 for details). Creating a parent mapping would 
add 379 ms, making the total pre-commit overhead a 
prohibitive 493 ms. Thus, the autocompleter requires 
unchecked lazy mode to make Crom’s speculations fea- 
sible. 

Commit costs were similar for the tab manager and the 
autocompletion widget. As Figure 12 shows, the reduc- 
tion in user-perceived latency was equally dramatic. 


6.2 Exploring Speculation Costs 


The previous section showed that Crom’s overheads 
are low when using unchecked lazy cloning and specula- 
tion zones. In this section, we examine the costs of full 
heap and checked lazy speculation, as well as the cost 
of copying the entire DOM tree. These types of spec- 
ulation require the least effort from the web developer, 
since there is no need to identify speculable DOM sub- 
trees or perform checked-mode refactoring. We believe 
that these activities are fairly straightforward for well- 
designed code. Unfortunately, poorly written JavaScript 
code abounds, and even good developers may generate 
tangled code when working on a constantly evolving web 
page. Thus, it is important to understand when checked 
mode or full copy speculation is feasible when the Crom 
API is not natively implemented by the browser. 


6.2.1 Copying the Full Application Heap 


Figure 7 describes the JavaScript environment for sev- 
eral popular web sites”. As expected, there is wide vari- 
ation across sites. Figure 13 shows the time needed by 


Figure 7 provides a lower bound on the number of variables in each 
site; Crom cannot discover objects which are only reachable by object 
trees rooted in DOM 2 handler state. 
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Figure 13: Copying the full application heap. 
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Figure 14: Copying the full DOM tree. 


Firefox to copy the full application heap in these sites. 
For many sites, the one-time copy overhead was low. For 
example, in a fairly complex site like MSN, the copy cost 
was 94 ms. However, this is a per-speculation cost, since 
each speculative execution needs a private copy of the 
heap. The larger the cost, the fewer speculations can be 
issued in a window of a few hundred milliseconds. 

For more complex sites like Amazon, YouTube, and 
MySpace, full heap copying was prohibitively expen- 
sive, mainly due to the overhead of function cloning. 
This overhead was proportional to both the number 
and the complexity of the functions. So, even though 
MySpace and YouTube had a similar number of 
functions (904 versus 894), it took 452 ms to clone 
MySpace’s heap, but 11 seconds to clone YouTube’s 
heap. The high cloning overhead for YouTube is a 
consequence of its sophisticated JavaScript functional- 
ity: it contains over 200K of code for manipulating Flash 
movies, performing click analytics, and managing adver- 
tisements. For sites like this, full heap copying is com- 
pletely infeasible, making lazy cloning a prerequisite for 
high-performance speculation. 


6.2.2 Copying the Full DOM Tree 


To copy the DOM tree (or a branch of the tree), Crom 
first calls cloneNode (true) on the root of the tree. 
Then, Crom traverses the cloned tree and the base tree in 
parallel, reference-copying the event handlers and deep- 
copying the user-defined objects belonging to the non- 
speculative DOM nodes. Figure 14 shows the relative 
costs of these two steps. Since clLoneNode () is imple- 
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Figure 15: Creating a parent map. 


mented in native code, the first step was very fast. The 
second step was more onerous since it was implemented 
by non-native Crom code. The aggregate copy cost rep- 
resents a per-speculation overhead, since each specula- 
tive computation must possess a private copy of the tree. 
Amazon had the highest copy overhead (247 ms), but 
40% of the remaining websites had copy penalties above 
100 ms. Thus, we expect that most speculative web sites 
will avoid full DOM copying and use speculation zones. 


6.2.3 Creating the Parent Map 


Figure 15 shows the cost of creating a parent map 
in various web sites. The overhead is split into two 
parts: the cost of traversing the DOM tree, and the 
cost of walking the object forest rooted in the applica- 
tion heap. Traversing the DOM tree was much faster 
than traversing the application heap. Whereas Crom 
could get a list of all DOM nodes using the native code 
document .getElementByTagName ("x"), it had 
to walk the application heap using a user-level breadth- 
first traversal. 

Creating a parent map and updating stale references is 
unnecessary if Crom copies the entire application heap. 
However, given the choice between unchecked execution 
with full heap copying, and checked execution with lazy 
copying, many applications will prefer the latter. This 
is because the overhead of parent mapping is amortiz- 
able across multiple speculations, whereas full heap copy 
costs cannot be shared. Comparing Figures 13 and 15 re- 
veals that for sites with small enough heaps to make full 
heap copying reasonable, creating the parent map is of- 
ten no more expensive than a single full heap copy; thus, 
when issuing multiple speculations, lazy copying quickly 
repays the initial parent mapping overhead. 


6.2.4 Committing a Speculative Domain 


Committing a speculative context requires three ac- 
tions. First, the speculative DOM tree must be spliced 
into the real DOM tree. Second, the cloned object trees 
from the application heap must be inserted into the real 
global name space. Third, if Crom is running in checked 
mode, the non-speculative parents of speculative objects 
must have their child references patched. 
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Figure 16: Committing a full DOM tree. 


Repairing the heap name space is essentially free—for 
each reference to an object tree root, Crom just makes it 
point to the speculative copy of the root object. The time 
required for parent patching is O(SP), where S is the 
number of speculatively cloned objects and P is the av- 
erage number of parents per object. S' is typically much 
smaller than the total number of objects, so in our expe- 
rience, fixing stale child references takes no more than 
20 ms. 

Splicing the speculative DOM tree into the real one 
is handled by native code. However, the splice may be 
the slowest step of the commit if the speculative tree is 
large. In these situations, the browser is forced to re- 
render large regions of the display, a computationally in- 
tensive task. Figure 16 shows the splice costs in the ab- 
sence of speculation zones, 1.e., when the entire DOM 
tree 1s copied and then reinserted. With the exception 
of CNN, each splice took less than 75 ms. Regardless, 
DOM splicing is a cost that must be paid by both specu- 
lative and non-speculative code, and even checked com- 
mits add little overhead to the fundamental cost of updat- 
ing the browser display. 


7 Related Work 


File systems often use prefetching to hide disk and 
network latency [13, 19], and many of these techniques 
can be applied to the web domain. Empirical studies 
show that over half of all web objects can be effec- 
tively prefetched, doubling the latency reductions avail- 
able from caching alone [14]. Various academic and 
commercial projects have tried to exploit this opportu- 
nity [4, 15, 20]. For example, Padmanabhan’s algo- 
rithm uses server-collected access statistics to generate 
prefetching hints for clients [17]. The current draft of the 
HTML 5 protocol allows such hints to be specified using 
a new “prefetch” attribute for <link> tags [23]. The 
Fasterfox extension for Firefox automatically prefetches 
statically referenced content in a page [10]. All of these 
approaches are limited to prefetching across static con- 
tent graphs. In contrast, Crom allows prefetching in web 
pages with interactive client-side code and dynamic con- 
tent. 


NSDI ’10: 7th USENIX Symposium on Networked Systems Design and Implementation 141 


142 


Speculative execution has been used to drive prefetch- 
ing in file systems [1, 8] and parallel computation sys- 
tems [2]. In these environments, a process is speculated 
forward, possibly on incorrect data, to discover what I/O 
requests may appear in the future. The sole purpose of 
speculative execution is to warm the cache; other side ef- 
fects are discarded. In contrast, Crom speculates on user 
activity rather than the results of I/O operations. When 
appropriate, Crom can also commit speculative contexts 
to hide computational latencies associated with screen 
redraws. 

The Speculator Linux kernel [16] supports speculative 
execution in distributed file systems. A client-side file 
system process speculates on the results of remote oper- 
ations and continues to execute until it tries to generate 
an externally visible output like a network packet. The 
process is blocked until its speculative input dependen- 
cies are definitively resolved. At that point, the process 
is either marked as non-speculative or rolled back to a 
checkpoint. Unlike Speculator, Crom has a server-side 
component which allows speculations to externalize net- 
work activity. However, Crom requires applications to 
be modified, whereas Speculator only modifies the OS 
kernel. 


$ Conclusion 


In this paper, we describe why speculative execu- 
tion is a natural optimization technique for rich web 
applications. We introduce a high-level API through 
which applications can express speculative intent, and a 
JavaScript implementation of this API that runs on un- 
modified browsers. This implementation, called Crom, 
automatically converts event handlers to speculative ver- 
sions and runs them in isolated browser contexts, caching 
both the fetched data and the screen layout for that data. 
Experiments show that Crom-enabled applications can 
reduce user-perceived delays by an order of magnitude, 
greatly improving the browsing experience. By abstract- 
ing away the programmatic details of speculative execu- 
tion, Crom successfully lowers the barrier to producing 
rich, low-latency web applications. 
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Abstract 


Today, large-scale web services run on complex sys- 
tems, spanning multiple data centers and content dis- 
tribution networks, with performance depending on di- 
verse factors in end systems, networks, and infrastructure 
servers. Web service providers have many options for 
improving service performance, varying greatly in feasi- 
bility, cost and benefit, but have few tools to predict the 
impact of these options. 

A key challenge is to precisely capture web object de- 
pendencies, as these are essential for predicting perfor- 
mance in an accurate and scalable manner. In this pa- 
per, we introduce WebProphet, a system that automates 
performance prediction for web services. WebProphet 
employs a novel technique based on timing perturba- 
tion to extract web object dependencies, and then uses 
these dependencies to predict the performance impact 
of changes to the handling of the objects. We have 
built, deployed, and evaluated the accuracy and effi- 
ciency of WebProphet. Applying WebProphet to the 
Search and Maps services of Google and Yahoo, we find 
WebProphet predicts the median and 95°” percentiles of 
the page load time distribution with an error rate smaller 
than 16% in most cases. Using Yahoo Maps as an exam- 
ple, we find that WebProphet reduces the problem of per- 
formance optimization to a small number of web objects 
whose optimization would reduce the page load time by 
nearly 40%. 


1 Introduction 


Software vendors and service providers are increasingly 
delivering services to users through the Internet. Large- 
scale web services, such as maps, search, and social net- 
working, have proliferated, attracting hundreds of mil- 
lions of users worldwide. On the client side, these 
services heavily leverage Asynchronous Javascript and 
XML (AJAX) to provide a seamless and consistent user 
experience across devices and form factors. Behind the 


scenes, significant amounts of data and computation are 
provided by servers in the cloud. 

Many web services are extremely complex, since they 
aim to match or even exceed the rich user experience of- 
fered by traditional desktop application. For instance, the 
“driving directions” webpage of Yahoo Maps comprises 
about 110 embedded objects and 670KB of Javascript 
code. These objects are retrieved from many differ- 
ent servers, sometimes even from multiple data centers 
(DCs) and content distribution networks (CDNs). These 
dispersed objects meet only at a client machine, where 
they are assembled by a browser to form a complete web- 
page. Since service providers lack object-level measure- 
ments obtained from clients, it is hard for them to as- 
sess and study user-perceived performance. Moreover, 
there exist a plethora of dependencies between different 
objects. Many objects cannot be downloaded until some 
other objects are available. For instance, an image down- 
load may have to await a Javascript download because 
the former is requested by the latter. These multiple fac- 
tors make it highly challenging to understand and predict 
the performance of web services. 

The performance of web services has direct impact 
on user satisfaction. Poor page load times (PLT) result 
in low service usage, which in turn may undermine ser- 
vice income. For instance, a study by Amazon reported 
roughly 1% sales loss as the cost of a 100 ms extra de- 
lay. Another study by Google found a 500 ms extra delay 
in display search results may reduce revenues by up to 
20% [16]. Even worse, users may simply abandon a ser- 
vice provider for another offering, as switching barriers 
are often low. 

Ideally, service providers would like to predict the 
effects of potential optimizations before actual deploy- 
ment. Yet it is seldom clear what benefits various pos- 
sible options for improvement might bring a service 
provider — whether optimization to the object structure 
of the page, or optimizations in the manner in which 
content is placed and delivered over the Internet. User- 


NSDI ’10: 7th USENIX Symposium on Networked Systems Design and Implementation 143 


144 


perceived PLT is affected by the loading time of web ob- 
jects and their dependencies. The loading time of each 
individual object is further affected by a variety of delay 
factors, including DNS lookup time, network round trip 
time (RTT), server response time, and client execution 
time. 

One compelling way to predict performance is to first 
measure the PLT through experiments on the service it- 
self (e.g., A/B tests [16] by varying a given property of 
the service), and then to extrapolate estimates using some 
form of regression. However, such experiments can be 
difficult to setup and expensive to sustain. It is not un- 
common for such experiments to run for days or even 
weeks, limiting the capacity for adding additional ex- 
periments. Furthermore, it is extremely challenging to 
sweep the space of all possible scenarios since the num- 
ber of scenarios grows exponentially with the number of 
objects and delay factors. Without detailed knowledge 
of object dependencies, it is difficult to decide how many 
distinct scenarios need to be measured to attain accurate 
predictions. 

Existing approaches for performance prediction gen- 
erally fall into two broad categories: provider based vs. 
end-system based. In the first category, WISE [23] pre- 
dicts performance based on server logs collected at the 
service provider’s data centers. As a result, this approach 
has limited visibility into some client-side factors that 
are crucial for user-perceived PLT, such as page render- 
ing time, object dependencies, and multiple data sources 
(crossing data centers and content providers). In the sec- 
ond category, Link Gradients [10] proposes to predict 
end-to-end response times of untested system configura- 
tions, assuming the effects of change in individual fac- 
tors are completely independent of each other. While 
this assumption may hold in small-scale enterprise ap- 
plications, it is inapplicable to complex web services in 
which inter-component dependencies are prevalent. 

To overcome these challenges and shortcomings, this 
paper presents WebProphet, a tool that predicts the im- 
pact of various optimizations on user-perceived PLT of 
web services. First, WebProphet aims to be applicable to 
a diverse set of web services. Second, WebProphet aims 
to automatically produce accurate predictions. Given 
the number of web services and the churns in their im- 
plementations, a tool that involves manual effort can be 
overly burdensome and error prone. 

WebProphet consists of a measurement engine, a de- 
pendency extractor, and a performance predictor. The 
dependency extractor employs a novel algorithm to in- 
fer dependencies between web objects by perturbing the 
download times of individual objects. Our key observa- 
tion is the delay of an individual object will be propa- 
gated to all of its dependent objects. While others have 
noticed that timing perturbation can convey information 
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(in particular, [20] uses such techniques to transmit data 
covertly), we are the first to apply it to systematically 
discovering web object dependencies. Given the depen- 
dency graph of a webpage, the performance predictor im- 
plements a simple and yet accurate method to simulate 
the page load process of a web browser. It can make fast 
and accurate PLT prediction under any combination of 
changes in objects and delay factors. It can also predict 
the statistical properties (e.g., median or 95th-percentile) 
of a PLT distribution under a hypothetical scenario. 

We applied WebProphet to four widely-used web ser- 
vices: Maps and Search of Google and Yahoo. We 
verified that our system successfully extracts the depen- 
dency graphs for all these services, even though some of 
the complex webpages comprise over 100 objects. We 
used WebProphet to predict PLT on real, popular web 
browsers using controlled experiments and the Planet- 
Lab testbed. Our evaluation shows that the predictions of 
WebProphet are highly accurate, with error rates mostly 
under 16%. This is quite promising given the inher- 
ent noise (e.g., different loss conditions) in these ex- 
periments. We then apply WebProphet to finding cost- 
effective optimization strategies for real applications. 
For instance, Yahoo Maps contains 110 objects and has a 
median PLT of 3.987 seconds measured from Northwest- 
ern University. By simply optimizing the client execu- 
tion time of 14 objects and moving 5 static objects from 
Yahoo data centers to the Akamai CDN, the median PLT 
of Yahoo Maps can be cut by nearly 40%. 

We continue to discuss the problem formulation and 
present an overview of WebProphet in 82. We describe 
dependency extraction in §3 and performance prediction 
in 84. The implementation is covered in 85. In 86 and 87, 
we show the results of dependency extraction and per- 
formance prediction respectively. We demonstrate how 
WebProphet helps to optimize the PLT of Yahoo Maps 
in 88. We evaluate the systems performance in 89. Fi- 
nally, we review the related work in 810 and conclude in 
$11. 


2 Problem Context 


Many web services are delivered to users in form of web- 
pages that can be rendered by a browser. Sophisticated 
webpages may contain many static and dynamic objects 
arranged hierarchically. To load a page, a browser typ- 
ically first downloads a main HTML object that defines 
the structure of the page. Next, it may download a Cas- 
cading Style Sheets (CSS) object that describes the pre- 
sentation of the page. The main HTML object may em- 
bed many Javascript objects that are executed locally to 
interact with a user. As the page is being rendered, an 
HTML or a Javascript object may request additional ob- 
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Figure 1: The page load time decomposition. 









































jects, such as images and Javascripts. This process con- 
tinues until all relevant objects are loaded. 


We define page load time (PLT) as the time between 
when the user triggers the page starting to load and when 
all the objects in the page are loaded. Sometimes, users 
do not care about all the objects in a page. For instance, 
a page may contain invisible images, advertisements, or 
user tracking services. Moreover, a user action may only 
trigger a few new objects to be loaded after the initial 
page load. Accordingly, we could also define PLT as 
the time to load a subset of objects in a page that are 
relevant to user-perceived performance. Note that there 
is a subtle difference between when objects are loaded 
and when objects are perceived by the user. While the 
latter is more directly related to user satisfaction, it is 
also harder to define and measure precisely. Therefore, 
we choose to focus on the former in this paper. 


As illustrated in Figure 1, we may decompose the 
loading time of each object into client delay, network de- 
lay, and server delay. The client delay is due to various 
browser activities such as page rendering and Javascript 
execution. The network delay can be further decomposed 
into DNS lookup time, TCP three-way handshake time, 
and data transfer time. TCP handshake time and data 
transfer time are influenced by network path conditions 
such as RTT and packet loss. The server delay is pro- 
duced by various server processing tasks such as retriev- 
ing static content or generating dynamic content. 


Service providers have many different options to im- 
prove the PLT of a webpage. For instance, they may 
upgrade the back-end infrastructure to reduce server re- 
sponse time for dynamic objects. They may use a CDN 
service to reduce the network delay for static objects. 
They may also optimize the implementation code to re- 
duce the client execution time for computation-intensive 
objects. While optimizing for an individual object or de- 
lay factor (or for a combination of multiple objects or 
delay factors) will bring some benefits, they may also in- 
cur significant costs in development and management. It 
is economically infeasible for a service provider to op- 
timize for every object/delay factor, and is often unclear 
where to find the biggest bang for the buck. Our goal is 
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Figure 2: System architecture. 


to build an automated system that can accurately predict 
the PLT improvement under any combination of changes 
in object and delay factor. A service provider can easily 
use our system to narrow down the optimization strate- 
gies that could bring the most benefits. 

For a web service, WebProphet predicts PLT based on 
a performance model extracted from client-side obser- 
vations. Compared to server-side techniques [23], our 
approach can take into account a few important factors 
that are visible only at the client. First, a modern web- 
page usually contains many objects which have depen- 
dencies between each other. As a result, the PLT cannot 
be estimated simply based on the page size and TCP- 
level characteristics such as RTT, packet loss, and con- 
gestion window size. In fact, the dependencies will deter- 
mine when an object can be loaded and which objects can 
be loaded in parallel. Second, many webpages comprise 
sophisticated HTML and Javascript objects to provide a 
rich user experience. Nonetheless, HTML rendering and 
Javascript execution may introduce significant client de- 
lay. Third, the objects in a page may come from multiple 
data centers and CDN nodes. For example, Yahoo Maps 
uses both the Akamai CDN and Yahoo data centers to 
deliver page content. Though the client side is the ideal 
place where we can measure the user-perceived PLT ac- 
counting for all these effects end-to-end, existing client 
browsers lack the measurement hooks needed. 

As shown in Figure 2, WebProphet has three major 
components. Given a webpage, the dependency extrac- 
tor infers the dependencies between objects by perturb- 
ing the download times of individual objects. The mea- 
surement engine controls multiple automated web agents 
which can drive a full-featured web browser (Firefox 3) 
to load the page. The measurement engine also collects 
one packet trace for each page load. Using the extracted 
dependency graph and the packet trace in a baseline sce- 
nario, the performance predictor estimates the PLT in a 
new scenario by simulating the page load process. 

The PLT of a webpage will not be a constant due to the 
variations of network latency, server response time, and 
load on the client. WebProphet can predict the statistical 
properties (e.g., median or 95th-percentile) of the PLT 
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distribution under a new scenario. For this purpose, we 
first collect a reasonably large number of page load traces 
in a baseline scenario using a web agent. Then, for each 
of these traces, we run performance prediction to obtain 
the PLT in the new scenario, and therefore produce the 
PLT distribution in a new scenario. 

Currently, we do not explicitly consider the effect of 
packet loss in our model. In other words, we assume the 
same loss condition in the baseline and new scenarios. 
Differences in loss conditions can change the number of 
round trips involved in loading an object, which in turn 
lead to prediction errors (84.1). The impact of packet 
loss on PLT can be highly variable, and highly depen- 
dent on factors such as network transients, TCP conges- 
tion states, and specific TCP loss recovery mechanisms. 
In spite of this limitation, as shown in 87, WebProphet 
attains high prediction accuracy in both controlled and 
real-world experiments under normal loss conditions. 


3 Dependency Extraction 


In this section, we first present an overview of depen- 
dency relationships between web objects and describe 
the types of dependencies that we aim to discover. We 
then explain the details of our dependency extraction al- 
gorithm based on timing perturbation. 


3.1 What are dependencies? 


Modern webpages may contain many types of objects, 
including HTML, Javascript, CSS, and image. These 
embedded objects are downloaded via separate requests 
on potentially multiple TCP connections instead of all 
at once. For instance, the main HTML object may con- 
tain a Javascript object whose execution will lead to ad- 
ditional downloads of HTML and image objects. We 
say one object depends on the other if the former can- 
not be downloaded until the latter is available. De- 
pendencies between objects can be caused by a num- 
ber of reasons. Common ones include: 1) The em- 
bedded objects in an HTML page will depend on the 
HTML page; 11) Since many objects are dynamically 
requested during Javascript execution, these objects de- 
pend on the corresponding Javascript; 111) The download 
of an external CSS or Javascript object may block the 
download of other types of objects in the same HTML 
page [22]; iv) Object downloads may depend on certain 
events in Javascript object or web browser. For instance, 
a Javascript object may download image B only after im- 
age A is loaded. 

Given an object A, its dependent objects usually can- 
not be requested before A is completely downloaded. 
However, there are exceptions. Today’s browsers ren- 
der an HTML page in a streamlined fashion, by which 
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we mean the HTML page can be partially displayed 
even before its download finishes. For example, if an 
HTML page has an embedded image, the image can be 
downloaded and displayed in parallel with the down- 
load of the HTML page. The image download may 
start once the tag <img src=... /> (identified by 
a byte offset in the HTML page) has been parsed. We 
call an HTML object a stream object. We use depen- 
dency of fset 4(img) to denote the offset of the last byte 
of <img src=... /> in the stream object A. We 
observed this streamlined processing behavior in major 
browsers including IE, Firefox and Chrome. 

Given an object X, we use descendant(X ) to denote 
the set of objects that depend on X and use ancestor (X ) 
to denote the set of objects that X depends on. By 
definition, X cannot be requested until all the objects 
in ancestor(X ) are available. Among the objects in 
ancestor(X ), we are particularly interested in object Y 
which is the last to become available. We call Y the last 
parent of X. If Y is a stream object, its available-time 
is when the dependency offsety(X ) has been loaded. 
If Y is a non-stream object, its available-time is when 
Y is completely loaded. In 84.1, we will explain how 
to use the available-time of Y to estimate the start time 
of X’s download. Essentially, this will allow us to pre- 
dict the PLT of a webpage. While X only has one last 
parent in one particular page load, its last parent may 
change across different page loads due to variations in 
the available-time of its ancestors. We use parent(X ) to 
denote the subset of the objects in ancestor(X ) which 
may be the last parent of X. 

Given a webpage, we use a parental dependency graph 
(PDG) to encapsulate the parental relationship between 
objects in the page. A PDG = (V,E) is a Directed 
Acyclic Graph (DAG) and includes a set of nodes and 
directed links. Each node is a web object. Each link 
Y — X means Y isa parent of X. 


3.2 How to extract dependencies? 


WebProphet extracts the dependencies of a webpage by 
perturbing the download of individual objects. Our key 
observation is the delay of an individual object will be 
propagated to its descendants. While conceptually sim- 
ple, the major challenge is to extract the stream parent of 
an object and the corresponding dependency offset. Sup- 
pose an object X has a stream parent Y. To discover 
this parental relationship and the dependency offset, the 
available-time of of fsety (X ) must be later than that of 
all the other parents of X in a particular page load. This 
requires the ability to control the download of not only 
each non-stream parent of X as a whole but also each 
partial download of each stream parent of X. As we will 
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see in 87.4, correctly extracting stream parents and de- 
pendency offsets is critical for accurate PLT prediction. 

Discovering ancestors/descendants: Given a webpage 
and its embedded objects, we discover the descendants 
of each object iteratively. In each round, we reload the 
page and delay the download of an object X for 7 sec- 
onds. Here, X is an object which has not been processed 
and 7 is much greater than the normal loading time of 
any object. The descendants of X are the objects whose 
download is delayed together with X for at least 7 sec- 
onds. We repeat this process until the descendants of all 
the objects are discovered. Note that the order by which 
we delay each object has no influence on the final result. 

Our approach for dependency extraction makes two 
assumptions. First, we assume the dependencies of a 
webpage do not change during the discovery process. 
This may not hold in practice. When a page is reloaded, 
there could be some minor changes in the new page. For 
instance, there could be parameter changes in the Uni- 
versal Resource Identifier (URI) of certain objects. We 
tackle this problem by matching similar URIs in differ- 
ent rounds according to edit distance. Moreover, there 
could be object changes due to reasons such as new ad- 
vertisements. We find such changes tend to have limited 
impact on the overall structure of the page or the PDG. 
This is because the number of affected objects is small 
and they usually do not have any descendants. In § 7, we 
will show that our prediction results are highly accurate 
in spite of minor changes in webpages. 

The second assumption we make is that the artificially 
injected delay will not change the dependencies in the 
page. Among the pages we studied, we found only one 
exception in the “driving directions” webpage of Google 
Maps. There are two Javascripts A. js and B. js which 
have the same parent main.js. We use A and B 
to represent the names of these two Javascripts, given 
their original names are very long. When main. js 1s 
severely delayed, A. js and B. js sometimes are com- 
bined into one single Javascript named AB. js. This 
probably reflects the fact that main.js attempts to 
adapt when it detects poor download speed. We iden- 
tified this application behavior because it leads to in- 
consistencies in the extracted dependencies of the page. 
Among the applications we studied, only Google Maps 
exhibits this behavior which is handled with a simple 
heuristic. In the future, we plan to devise a more sys- 
tematic solution to deal with such behavior. 

Extracting non-stream parents: Given a non-stream 
object X and its descendant Z, we observe that X is the 
parent of Z if and only if there does not exist an object 
Y which is the descendant of X and the ancestor of Z. 
On the one hand, if such Y exists, the available-time of Y 
will always be later than that of X . This is because X is a 
non-stream object and Y cannot be downloaded until X 
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Figure 3: Stream parent example. 


is available, which implies X cannot be the parent of Z. 
On the other hand, if Y does not exist, we can imagine a 
scenario where X is delayed until all the other ancestors 
of Z are available. This is possible because none of the 
other ancestors of Z depend on X. This implies X may 
indeed be the parent of Z. Based on this observation, 
Algorithm ExtractNonStreamParent takes the set 
of objects and the set of descendants of each object (in- 
ferred from the previous step) as input and computes the 
parent set of each object. 


ExtractNonStreamParent(Object, Descendant) 
For X in Object 
For Z in Descendant (xX) 
IsParent = True 
For Y in Descendant(X) 
If (Z in Descendant(Y)) 


IsParent = False 
Break 
EndIft 
EndFor 
If (IsParent) add X to Parent(Z) 
EndFor 


EndFor 


Extracting stream parents and dependency offsets: 
The method described above may not be useful for dis- 
covering the stream parent of an object. We illustrate 
this with an example in Figure 3. A large HTML ob- 
ject H contains a Javascript J and an image J. J and I 
are embedded in the beginning and the end of A respec- 
tively (offsety(J) < offsety(1)). Because the URI of 
T is defined in J, J cannot be downloaded until J is exe- 
cuted. This causes J to depend on both H and J while J 
only depends on H. According to the previous method, 
H cannot be the parent of J since J is the descendant of 
Af and the ancestor of [. Nonetheless, when the down- 
load of H is slow, J may have been downloaded and 
executed before offset (I) becomes available. In this 
case, H becomes the last parent of J. 

Given a stream object H and its descendant J, we use 
the following method to determine whether #7 1s the par- 
ent of J. We first reload the whole page and control the 
download of H at an extremely low rate A. If A is the 
parent of J, all the other ancestors of J should have been 
available by the time offsety(J) is available. We can 
then estimate offset (I) with offsety(I)’, where the 
latter is the offset of H that has been downloaded when 
the request of J starts to be sent out. offsety(I)’ can 
be directly inferred from network traces and is usually a 
bit larger than offset#(/). This is because it may take 
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some extra time to request J after offset (J) is avail- 
able. Since H is downloaded at an extremely low rate, 
these two offsets should be very close. 

Given of fsetz(I)’, we perform an additional parental 
test to determine whether # is the parent of J. We reload 
the whole page again. This time, we control the down- 
load of HT at the same low rate A as well as delay the 
download of all the known non-stream parents of I by 7. 
Let offset (I)” be the offset of H that has been down- 
loaded when the request of J is sent out in this run. If 
offsety(l)” — offsety(1)’ <7 x 4X, this indicates the 
delay of /’s known parents has little effect on when J is 
requested. Therefore, H should be the last parent of J. 

The choice of 4 reflects the trade-off between mea- 

surement accuracy and efficiency. A smaller A allows 
us to estimate of fsetz(I) more accurately but leads to 
longer running times. The parameter 7 directly affects 
the accuracy of parental tests. If 7 1s too small, the 
results may be susceptible to noise in experiment, in- 
creasing the chance of missing true parents. If 7 is 
too large, we may mistakenly infer a parental relation- 
ship because offset (I)” — offset y(I)’ is bounded by 
sizey — offsety(I) where size yz is the page size of H. 
In our current system, we use A = size x /200 bytes/sec 
and 7 = 2 seconds. This means the HTML object H will 
take 200 seconds to transfer. We will study the accuracy 
of dependency extraction in 8 6. 
Discussion: We currently infer the timing information 
from the packet trace of a page load (84.1). One alternate 
approach is to extract dependencies through some com- 
bination of static and dynamic program analysis. In fact, 
it is quite straightforward to parse an HTML object to 
extract its dependencies. However, extracting the depen- 
dencies of Javascript objects requires extensive browser 
instrumentations. Since the PDG of a page may vary de- 
pending on how the page is rendered by a browser, we 
will have to instrument each type and each version of 
the major browsers. In comparison, our trace-based ap- 
proach can more easily work with different browsers. 


4 Performance Prediction 


In this section, we describe our methodology for predict- 
ing performance under hypothetical scenarios. Given the 
PLT of a webpage in a baseline scenario, we aim to pre- 
dict the new PLT when there are changes in the delay fac- 
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tors (including client delay, server delay, RTT, and DNS 
lookup time) of any objects in the page. The basic idea 
is to develop a model that can simulate the page load 
process of a browser under any hypothetical scenarios. 
In practice, the page load process can be very complex, 
since it also relates to browser behavior and parameters, 
web objects dependencies, versions of TCP and HTTP 
protocols, and network conditions. The key challenge is 
to keep the model simple and yet accurate. This requires 
us to provide the right level of abstraction in the model 
which captures the most fundamental characteristics of 
webpages and browsers. 

Figure 4 illustrates the overall flow of performance 
prediction in WebProphet. We first infer the timing infor- 
mation of each object from the packet trace of a page load 
in a baseline scenario. Based on the PDG of the page, we 
further annotate each object with additional timing infor- 
mation related to client delay. We then adjust the object 
timing information to reflect the changes from the base- 
line scenario to the new one. Finally, we simulate the 
page load process with the new object timing informa- 
tion to estimate the new PLT. We will explain the first 
three steps in 84.1 and leave the details of the last step in 
4.2. 


4.1 Acquiring object timing information 


Inferring basic object timing information: We infer 
web objects and their timing information from the packet 
trace of a page load collected on the client side. This 
makes our approach easily deployable since it does not 
require any instrumentation in browsers or applications. 
We identify three types of activities in the trace: 


e DNS: the time used for looking up a domain name. 


e TCP connection: the time used for establishing a 
TCP connection. 

e HTTP: the time of loading a web object. As 
illustrated in Figure 5, an HTTP activity can be 
further decomposed into three parts: (2) Request 
transfer time: the time to transfer the first byte to 
the last byte of an HTTP request; (22) Response 
time: the time from when the last byte of the 
HTTP request is sent out to when the first byte of 
the HTTP reply is received. This includes one RTT 
plus server delay; (227) Reply transfer time: the 
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time to transfer the first byte to the last byte of an 
HTTP reply. 


In addition, we infer the RTT for each TCP connec- 

tion. The RIT of a TCP connection should be quite 
stable since the entire page load process usually lasts 
for only a few seconds. We also infer the number of 
round-trips involved in transferring an HTTP request or 
reply. Such information allows us to predict HTTP trans- 
fer ttmes when RTT changes. We will provide the details 
of packet trace analysis in 85.3. 
Adding client delay information: When the last par- 
ent of an object X becomes available, the browser will 
not issue a request for _X immediately. This is because 
the browser needs time to do some additional process- 
ing, e.g., parsing HTML page or executing Javascript. 
For object X, we use the client delay to denote the time 
from when its last parent is available to when the browser 
starts to request it. When the browser loads a sophisti- 
cated webpage or the client machine is slow, client delay 
may have significant impact on PLT. We infer the client 
delay of each object by combining basic object timing 
information with the PDG of the page. Note that when 
the browser starts to request an object, the first activity 
can be DNS, TCP connection, or HTTP depending on 
the current state and behavior of the browser. 

Many browsers limit the maximum number of TCP 
connections to a host, e.g., six in IE 8 and Firefox 3. This 
can cause the request for an object to wait for available 
connections even when it is ready to be sent. Therefore, 
the client delay we observe in a trace may be longer than 
the actual browser processing time. To overcome this 
problem, when collecting the packet trace in a baseline 
scenario, we set the TCP connection limit of the browser 
to a large number, for instance, 30. This helps to elimi- 
nate the effects of connection waiting time. Nonetheless, 
we will still predict the PLT in a new scenario under the 
default TCP connection limit of the browser (84.2). 
Adjusting object timing information according to 
new scenario: So far, we have obtained the object tim- 
ing information under the baseline scenario. We need to 
adjust the timing information for each object according 
to the new scenario. Let servers be the server delay dif- 
ference between the new and the baseline scenario. We 
simply add servers to the response time of each object 
to reflect the server delay change in the new scenario. We 
use similar methods to adjust DNS activity and client de- 
lay for each object. RTT change (rtts) needs some spe- 
cial handling. Suppose the HTTP request and response 
transfers involve m and n round-trips for object X. We 
will add (m ++ 1) x rtts to the HTTP activity of X 
and rtts to the TCP connection activity if a new TCP 
connection is required for loading X. Our assumption 
is that the number of round-trips involved in loading an 
object is the same in the baseline and new scenarios. Our 


results in 87 confirm the validity of this assumption in 
PlanetLab experiments. This assumption could be vio- 
lated if bandwidth becomes the bottleneck, e.g., in DSL 
link. Further research is needed to deal with such scenar- 
10S. 

Discussion on object & DNS cache: Besides the four 
delay factors mentioned above, the PLT of a page in a 
new scenario will also be affected by the object and DNS 
cache. To handle cached objects and DNS names in a 
new scenario, we collect page load traces with the same 
set of cached objects and DNS names in a baseline sce- 
nario. We will explain how to control object and DNS 
cache in §5.1. Suppose W is the PDG of a page when 
no object is cached. When an object x is cached, WV will 
transform into a new PDG W’ where x is removed and 
each of its children x, is directly connected with each of 
its parents x,. Accordingly, the client delay of (x, 2p) 
in W’ will include the cache lookup time of x and the 
client delay of (x, x,) and (x., x) in VW. Hence, there is 
no need to explicitly consider the timing information of 
xin WU’, 

Our current approach can only predict the PLT under 
the same caching state. Given a page with n objects, we 
will need to measure 2” baseline scenarios to handle all 
the possible caching states. To reduce the measurement 
overheads, we could explicitly model the timing infor- 
mation of a cached object x in three cases: 1) TTL has 
not expired: x is directly looked up from cache; 11) TTL 
has expired but x has not changed: «x is revalidated and 
then looked up from cache; 111) TTL has expired and x 
has changed: «x is revalidated and downloaded from the 
server. To predict the PLT under any caching state, we 
simply need to extend our model to include the cache 
lookup time and the number of round trips involved in 
the revalidation of each object. We can use a small con- 
stant to represent the former and perform controlled ex- 
periments to measure the latter. The details are out of the 
scope of this paper. 


4.2 Simulating page load process 


We now describe our methodology for predicting PLT 
based on object timing information. The key challenge 
here is object downloads are not independent from each 
other. The download of an object may be blocked be- 
cause its dependent objects are unavailable or because 
there are no TCP connections ready for use. To tackle 
these problems, we simulate the page load process by 
taking into account the constraints of browser and PDG. 
Browser behavior: We studied a few popular browsers 
including IE, Firefox and Chrome. They share a few 
important features. Presently, they all use HTTP/1.1 ei- 
ther with HTTP pipelining disabled by default or without 
pipelining support at all. This is because HTTP pipelin- 
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[Available TCP connections | - |- | Y | N[N 
[Max # of parallel connections | -|- | - |N]¥_ 
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Table 1: Five possible cases for loading an object. 
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Figure 6: Four cases of activities for loading an object. 


ing has not been widely supported by proxies and may 
have head-of-line blocking with the presence of dynamic 
contents [1], e.g., one slow response in the pipeline will 
block other subsequent responses. Given the fact, we 
do not consider the effect of pipelining in this paper. 
More sophisticated techniques might be needed to model 
pipelining if it becomes widely-used in the future. With- 
out pipelining, HTTP request-reply pairs do not overlap 
with each other within the same TCP connection. 

In HTTP/1.1, a browser uses persistent TCP connec- 
tions which can be reused for multiple HTTP requests 
and replies. The browser attempts to keep the number 
of parallel connections small. It opens a new connection 
only when it needs to send a request but all the existing 
connections are occupied by other requests or replies. A 
browser is configured with an internal parameter to limit 
the maximum number of parallel connections with a par- 
ticular host. Note that the limit is applied to a host instead 
of to an IP address. If multiple hosts map to the same IP 
address, the number of parallel connections with that IP 
address can exceed the limit. 

Loading an object may trigger multiple activities in- 
cluding looking up a DNS name, establishing a new 
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TCP connection, waiting for an existing TCP connec- 
tion, and/or issuing an HTTP request. Table 1 lists the 
five possible cases and the conditions of each of these 
cases. A “-” in the table means the corresponding con- 
dition does not matter. The activities involved in each 
case are illustrated in Figure 6. For instance, in Case V, 
a browser needs to load an object from a domain with 
which it already has TCP connections. Because all the 
existing TCP connections are occupied and the number 
of parallel connections has reached the maximum limit, 
the browser has to wait for the availability of an existing 
connection to issue the new HTTP request (Figure 6(d)). 


PredictPLT(ObjectTimingInfo, PDG) 
Insert root objects into CandidateQ 
While (CandidateQ not empty) 
1) Get earliest candidate C from CandidateQ 
2) load ¢C according to conditions in Table 1 
3) Find new candidates whose parents 
are available 
4) Adjust timings of new candidates 
5) Insert new candidates into CandidateQ 
Endwhile 


Simulating page load: Given a webpage, Algorithm 
PredictPLT takes the timing information of each ob- 
ject and the PDG as input and simulates the page load 
process. The PLT is estimated as the time when all the 
objects are loaded. For each object X, the algorithm 
keeps track of four time variables: 1) T,: when X’s last 
parent is available; 11) 7;.: when the HTTP request for X 
is ready to be sent; 111) 7’: when the first byte of the 
HTTP request is sent; and iv) 77: when the last byte 
of the HTTP reply is received. Figure 6 illustrates the 
position of these time variables in four different scenar- 
10s. In addition, the algorithm maintains a priority queue 
C'andidateQ which contains the objects that can be re- 
quested. The objects in C‘andidateQ are sorted based 
on T,,.. 


5 Implementation 


As illustrated in Figure 2, the implementation of 
WebProphet comprises three major components. In this 
section, we will describe each of them in more detail. 
The whole system is implemented with roughly 11,000 
lines of code in Python, Perl, Javascript and Bro’s policy 
language [18]. 


5.1 Measurement engine 


The measurement engine includes a set of web agents 
which are currently deployed at multiple PlanetLab sites. 
These web agents allow us to measure the PLT of a web- 
page under diverse client, server, and network condi- 
tions. A centralized controller is used to maintain the 
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continual operation of the agents and perform upgrade 
when a new version of agent software becomes available. 
The controller can upload script snippets to an agent to 
control the interaction between the agent and a webpage. 
It also retrieves and stores the packet traces from the 
agents logged by tcpdump. The controller is written 
in Perl and Python with 1,300 lines of code. 


The web agent needs to meet a few requirements. 
First, it should be able to interact with a webpage au- 
tomatically. As mentioned in 82, WebProphet requires a 
potentially large number of page load traces in a baseline 
scenario to predict the statistical properties (e.g., median 
or 95th-percentile) of the PLT distribution in a new sce- 
nario. Second, it should behave like a full-featured web 
browser in order to simulate user interaction with sophis- 
ticated web 2.0 applications, e.g., Google Maps. This 
is especially important for correctly measuring the user- 
perceived PLT of these complex applications. Third, 
it should provide support for setting object and DNS 
cache, which will affect the PLT (84.1). We need to con- 
trol the web agent to cache the same set of objects and 
DNS names in the baseline and new scenarios. Fourth, 
it should be able to adjust the parallel TCP connection 
limit, e.g., to a large number, to eliminate the impact of 
connection waiting time (84.1). 


Existing web measurement tools cannot meet our re- 
quirements. Simple web clients (e.g., wget, curl, and 
lynx) do not execute the Javascripts in the pages. Web 
form automation tools [3, 5, 7] and KITE [4] (an auto- 
mated web measurement tool developed by Keynote) do 
not provide control for object/DNS cache or TCP con- 
nection limit. This prompts us to develop our own web 
agent, which uses Jssh plug-in to take full control of the 
Firefox 3 browser. Through the XPCOM [8] interfaces 
of Firefox, we use Javascript to call the internal APIs of 
Firefox. These internal APIs supports object and DNS 
cache control, TCP connection limit adjustment, and user 
input simulation. The user inputs from mouse and key- 
board can be simulated as DOM [6] events. 


We developed a set of script snippets to automate the 
interactions with multiple complex web services, such 
as Google/Yahoo Search, Google/Yahoo Maps, Gmail, 
Hotmail, Flickr, etc. The script snippet for each web ser- 
vice comprises only about 10 to 150 lines of code, de- 
pending on the complexity of the service. We believe it 
1S quite easy to create new script snippets for other ser- 
vices too. The web agent, excluding the service-specific 
script snippets, is implemented in Javascript and Python 
with 1,100 lines of code. The whole automation part of 
the web agent has no measurable effects on PLT since it 
incurs very little overhead. 


5.2 Dependency extractor 


To extract the PDG of a web page, we setup a web agent 
to go through a web proxy running on the same host. 
The web proxy is modified from a simple proxy writ- 
ten in Python. We extended the proxy with the support 
of delaying the download of any specified object, which 
is required for discovering the descendants of the object 
(83.2). We also added the functionality of controlling the 
download speed of a stream object, which is required for 
discovering stream parent and dependency offset (83.2). 

Given a webpage, we first obtain the list of its embed- 
ded objects by loading it once. The proxy will cache all 
the objects observed in the first round for future use. This 
reduces the measurement overhead imposed on the origin 
servers. We then subsequently control the download of 
one object during each page reload and record the timing 
information of object download. Finally, we extract the 
PDG according to the procedure described in 83.2. The 
dependency extractor and web proxy include 2,800 lines 
of code in Python. 


5.3. Performance predictor 


The performance predictor comprises a trace analyzer 
and a page simulator. The trace analyzer extracts ba- 
sic object timing information (described in 84.1) from 
packet traces in pcap format. It leverages Bro [18], a 
network intrusion detection system, to parse DNS, TCP, 
and HTTP protocol information. We write programs us- 
ing Bro’s policy language to recover timing information, 
e.g., DNS lookup time, TCP handshake time, and HTTP 
transfer and response times. 

We estimate the RTT of a TCP connection using the 
time between the SYN and SYN/ACK packets. This is 
because many web services have relatively short TCP 
connections (e.g., a few seconds) and the RTT is usually 
quite stable in such time scale. We could use other exist- 
ing techniques [12, 24] to estimate RTT for web services 
that involve long TCP connections. We infer the num- 
ber of round trips involved in an HTTP transfer based 
on the TCP self-clocking behavior [24] — the packets 
in the same TCP send window are very close to each 
other while the packets in two adjacent send windows 
are at least one RTT apart. We compute the server delay 
by subtracting one RTT from the time interval between 
two adjacent send windows. Using this method, we find 
Google and Yahoo Search process user query in a stream- 
lined fashion. The server will return partial results to 
users while additional results are being generated. This 
causes some extra delay to the reply packets in multiple 
different round trips. Our method can handle such cases 
well, leading to high prediction accuracy reported in 87. 

The page simulator combines the basic object timing 
information and the PDG to infer the client delay of each 
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object. It then adjusts the timing of each object accord- 
ing to the specifications in the new scenario. Finally, it 
simulates the page load process (84.2) and outputs the 
predicted PLT. The trace analyzer and page simulator in- 
clude 6,200 lines of code written in Python and Bro’s 
policy language. 


6 Results of Dependency Extraction 


We now characterize and validate the PDGs of 
Google/Yahoo Search and Maps_ extracted by 
WebProphet. Google/Yahoo Search are two of the 
most popular web services and their PDGs are relatively 
easy to validate. In contrast, the PDGs of Google/Yahoo 
Maps are much more complex. In fact, Yahoo Maps has 
the most complicated PDG in terms of the number of 
objects and dependencies among all the web services 
we studied (including Amazon.com, Flickr, and Google 
Docs). 

In this paper, we only present the results on the cases 
where there is no cached object. This is common for ac- 
cessing webpages in which most contents are dynamic. 
For instance, Yahoo performance team found 40 to 60% 
of Yahoo users have an empty cache when visiting Ya- 
hoo [22]. Our approach also works when some objects 
are cached and we omit those results here due to lack of 
space. 





9 Images 


Figure 7: The PDG of Google Search. 
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Figure 8: The PDG of Yahoo Search. 





Google/Yahoo Search PDG: Figure 8 illustrates the in- 
ferred PDGs of Google/Yahoo Search for the search key- 
word “mapouka’”. The Google Search page has one 
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Figure 9: The simplified PDG of Google Maps. 


HTML and several images while the Yahoo Search page 
has one HTML, one CSS, and a few Javascripts and im- 
ages. The Google Search page is simpler than that of 
Yahoo. The former has not only a fewer number but 
also fewer levels of dependencies. This could be one 
of the factors that cause the PLT of Google Search to 
be smaller (87). The PDGs for different keywords have 
similar structure. Some keywords may lead to an extra 
Javascript object or a different number of images in the 
PDG of Google Search. Because the Search pages are 
not very complicated, we are able to verify that the PDGs 
produced by WebProphet are the same as those extracted 
through manual code analysis. 

Google/Yahoo Maps PDG: We study the PDG of the 
driving direction pages of Google/Yahoo Maps. We use a 
pair of addresses of the Whole Foods stores in Arkansas, 
USA. The full PDGs of Google/Yahoo Maps are too 
complex to read, e.g., Yahoo Maps has a total number of 
110 objects and 172 dependencies. Instead of showing 
the full PDGs, we simplify them by merging the objects 
that share the same sets of parents and children into a 
single node. The two simplified PDGs are shown in Fig- 
ure 9 and 10. Each node carries a label which describes 
the number of objects of certain type. For instance, la- 
bel “#HTML=1,#JS=1,4#IMG=17” means this node cor- 
responds to 1 HTML, 1 Javascript, and 17 image objects 
in the full PDG. 

Apparently, the PDGs of Maps are significantly more 
sophisticated than those of Search. They include more 
Javascripts for user interactions and more images for map 
tiles. The PDG of Yahoo Maps is even more complex 
than that of Google Maps, as the former comprises a 
larger number and more levels of dependencies. The 
PDGs for different address inputs are quite similar. The 
main differences are in the map tile images. 

The Javascripts of Google/Yahoo Maps are not only 
large (536KB and 670KB respectively) but also obfus- 
cated. It is difficult for us to validate the extracted PDGs 
via manual code analysis. Instead, we verify their cor- 
rectness in an indirect manner. First, we use our de- 
pendency extractor to obtain the “approximate” PDGs of 
Google/Yahoo Maps. Then we construct our own web- 
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Figure 10: The simplified PDG of Yahoo Maps. 


pages which exactly match the “approximate” PDGs in 
the number of objects, the types of objects, and the de- 
pendencies between objects. After that, we attempt to 
infer the PDGs of the constructed pages as if we know 
nothing about the “approximate” PDGs. We find the 
inferred PDGs exactly match the “approximate” PDGs. 
Although this does not prove that we have extracted 
the real PDGs of Google/Yahoo Maps, it at least sug- 
gests that we can correctly handle webpages as com- 
plex as Google/Yahoo Maps. Moreover, in 87, we will 
show that WebProphet can accurately predict the PLT of 
Google/Yahoo Maps under various hypothetical scenar- 
10S using the “approximate” PDGs. 


7 Prediction Accuracy 


In this section, we evaluate the PLT prediction accuracy 
of WebProphet for Google/Yahoo Search and Maps. We 
first conduct controlled experiments by manipulating the 
DNS delay, RTT, and server delay for all the objects or 
a subset of the objects in a webpage. We then conduct 
real-world experiments in which every delay factor of 
every object changes simultaneously. We find that ignor- 
ing object dependencies may lead to significant errors in 
the PLT prediction for complex webpages. Finally, we 
show that identifying stream parent and dependency off- 
set is particularly important for accurate PLT prediction 
for simple webpages. 

Suppose t, and ¢,, are the PLT in the baseline and new 
scenarios and ¢t, is the PLT predicted by WebProphet. 


We could have used err, = abs lp tn) the relative er- 
ror between t, and f,, to evaluate the prediction accu- 
racy. However, we find err, may not be the right metric 
because it tends to be small when abs(ty — tn) << tn. 
Therefore, we choose err, = abs(1— ao ) as our eval- 
uation metric. It represents the relative error of predicted 


PLT change compared to the actual PLT change between 





the baseline and new scenarios. For instance, suppose 
th = 9, tn, = 4, andt, = 4.2 seconds. The prediction 
error will be 5% measured in err, vs. 20% measured in 
err.. AS mentioned in 82, the PLT of a webpage may 
not be a constant under a given scenario. In this paper, 
we focus on err?” and err?° which are computed based 
on the median and 95th-percentile in the baseline, new, 
and predicted PLT distributions. These two metrics help 
to quantify whether WebProphet can make accurate pre- 
diction both for the normal case and for the extreme case. 

In the following experiments, we consider the scenar- 
10s where a web service provider is interested in predict- 
ing the PLT reductions as a result of certain optimiza- 
tions to the service. To evaluate the prediction accuracy 
in each of the experiments, we collect two set of page 
load traces in the baseline scenario. The first set, col- 
lected with normal TCP connection limit, is for produc- 
ing the baseline PLT distribution. The second set, col- 
lected with a large TCP connection limit, is for infer- 
ring the object timing information in the baseline sce- 
nario (84.1). Thereafter, this object timing information 
is used to generate the predicted PLT distribution in the 
new scenario. We also collect one set of traces in the new 
scenario, from which we can extract the actual new PLT 
distribution for validation purpose. Each of the three sets 
contains 500 page load traces, which provides enough 
samples for computing err?° and err2°. We only present 
the results based on one random keyword for the Search 
services and one random pair of addresses for the Maps 
services. The results of using other keywords or pairs of 
addresses are similar. 

Before presenting the results, we discuss two problems 
that may cause the predicted PLT to deviate from the ac- 
tual PLT. First, there could be slight differences between 
the times when the three set of traces are collected. These 
time differences may lead to differences in the client, 
Server, and network conditions under which the traces 
are collected. The resulting “prediction error” is actu- 
ally due to the limitations of our validation methodology 
rather than due to the limitations of our approach. Sec- 
ond, as mentioned in §2, we currently do not explicitly 
consider packet loss in our model. Any loss behavior 
differences between the baseline and new scenarios may 
also lead to prediction error. As shown in the following 
results, WebProphet can attain high prediction accuracy 
in spite of these two problems. 


7.1 Controlled experiments 


In the controlled experiments, we evaluate the accuracy 
of our performance prediction under various RTTs, DNS 
lookup times, and server response times. Figure 11 de- 
picts the setup of our experiments located at Northwest- 
ern University. We run a web agent to collect the packet 
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Figure 12: The prediction errors under different injected delays of RTT, DNS lookup time, and server response time. 
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Figure 11: Setup of controlled experiments. 


traces of page loads. The controlled gateway is used to 
inject extra delays during page loads. The normal gate- 
way does not manipulate any traffic. We configure the 
routing table on the web agent to forward the traffic to 
the controlled gateway or to the normal gateway to cre- 
ate the baseline and new scenarios respectively. Since 
we currently do not have a precise way to inject client 
delays, we simply keep the web agent lightly-loaded all 
the time. This ensures client delays are roughly the same 
in all the controlled experiments. In the next section, we 
will show that WebProphet can achieve high prediction 
accuracy even when client delays change. 


We use netem to inject extra RTT and DNS lookup 
times on the controlled gateway. netem is a network 
emulator in Linux which can add queuing delay to each 
traversing packet. To inflate RTT, we simply forward 
web traffic to the controlled gateway while forwarding 
the DNS traffic to the normal gateway. We may in- 
flate DNS lookup time in a similar way. Unfortunately, 
netem cannot be used to inject extra server response 
time because it can only add queuing delay to every 
packet. Instead, we develop our own tool based on 
libpcap and 1ibdnet to inflate server response time. 
Our tool can identify and delay the packets that corre- 
spond to the HTTP requests from the agent to the web 
server for certain amount of time. In effect, this extra 
delay will be considered as part of the server response 
time (84.1). Note that this may trigger TCP timeout on 
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Gsearch | 0.74 | 0.21 
Ysearch | 1.04 | 0.26 





Table 2: Changing RTT and DNS lookup time together. 


the web agent. Our tool will intercept and drop all the 
related retransmitted packets. 


Manipulating one delay factor at a time: In the first 
group of experiments, we evaluate the prediction accu- 
racy when we change one delay factor for all the objects 
across a certain range. We inject five different delay val- 
ues (100, 150, 200, 250, and 300 ms) to create the base- 
line scenarios. These values reflect the real delay differ- 
ences observed from different PlanetLab sites (e.g., those 
in Asia vs. those in the US) to our server at Northwestern 
University. We use the scenario without any injected de- 
lay as the new scenario that we aim to predict. Figure 12 
(a)-(c) illustrate the err?° for the four web services as we 
change RTT, DNS delay, and server delay. Among the to- 
tal of 60 experiments, 50% of them have err?° < 6.1%, 
90% err2° < 16.2%. The maximum err?? is 21.7%. 


Manipulating RTT and DNS delay together: Next, we 
evaluate the accuracy of performance prediction when 
multiple delay factors change simultaneously. We in- 
flate both RTT and DNS delay by 100 ms for all the 
objects to create the baseline scenario. We still use the 
scenario without injected delay as the new scenario. Ta- 
ble 2 shows the prediction error for the four applications. 
tp°, t?? and t3° are the median PLT of the baseline, new, 
and predicted scenarios. WebProphet can accurately pre- 
dict not only the median PLT but also the 95th-percentile 
PLT. The maximum err?° and err?? are only 5.5% and 
15.9% respectively. 

Manipulating only a subset of objects: So far, we have 
changed the delay factors for all the objects simultane- 
ously. In fact, WebProphet can make accurate predic- 
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Table 3: Inflating the RTT to different DCs. 


tion when we change the delay factors for any subset of 
objects. When visiting Yahoo Maps from Northwestern 
University, the web agent will download objects from 
Akamai CDN and two Yahoo data centers (Y DC; and 
Y DC). In the following experiments, we create three 
baseline scenarios by injecting 100 ms extra RTT to the 
objects from Akamai CDN, Y DC}, or Y DC» respec- 
tively. We still use the scenario without injected delay 
as the new scenario. Our setup is to simulate the case in 
which the owners of Yahoo Maps want to predict the new 
PLT if they could reduce the RTT from users to one of 
their DCs or Akamai CDN. The results in Table 3 show 
that the prediction errors are reasonably small in all the 
three experiments. 


7.2 Real-world experiments 


In the controlled experiments, we changed the delay fac- 
tors by the same amount for a set of objects. In this sec- 
tion, we conduct experiments on PlanetLab to demon- 
strate the effectiveness of WebProphet even when each 
delay factor of each object changes by a different amount 
simultaneously. The PlanetLab nodes in the US normally 
experience smaller PLT when accessing Google/Yahoo 
Search and Maps than those in Asia and Europe. For 
each of the four web services, we pick one international 
node as the baseline scenario. We use a node at North- 
western University as the new scenario. This is to sim- 
ulate the case where the service owners want to predict 
the new PLT if they could optimize their services for in- 
ternational users in certain way. To predict the new PLT, 
we replace the timing information of each object in the 
baseline scenario with the timing information of the same 
object in the new scenario and then run the page load 
simulation. 

Table 4 shows the locations of the baseline scenario 
and the prediction errors. Both err?? and err?° are 
within 10.7% for all the four services. We find the pre- 
diction errors in the PlanetLab experiments are gener- 
ally smaller than those in the controlled experiments. 
Since we directly use the object timing information in the 
new scenario for PLT prediction in the PlanetLab exper- 
iments, the prediction results are no longer affected by 
the trace collection time differences between the baseline 
and new scenarios. This suggests our model does capture 
sufficient level of detail for accurate PLT prediction. 


e . © 
Service | Baseline No-stream err?” 


Gsearch | Singapore | US 21.2% 
258.7% 
Sweden 1.2% 
Poland 0.1% 


Ysearch | Japan 


Gmap 
Ymap 





Table 4: The results in the real-world experiments. 


7.3. Importance of modeling object depen- 
dencies 


One alternate approach for performance prediction is to 
measure the PLT under a range of values for each de- 
lay factor of each object and then make predictions by 
extrapolating from these measured PLI’s through some 
form of regression. This may not be feasible for com- 
plex webpages with many embedded objects. For in- 
stance, even if we measure the PLT only under two differ- 
ent values for each delay factor of each object in Yahoo 
Maps, we will end up measuring the PLT in 244° scenar- 
ios, when considering all the possible combinations of 
four factors each for 110 objects. Without detailed do- 
main knowledge, it is difficult to decide how many dis- 
tinct scenarios indeed need to be measured for accurate 
prediction. 

One way to reduce the number of measured scenar- 
10s required for prediction is by assuming independence 
among every delay factor of every object. Let x; (2 = 
1...n) be the delay factors of all the objects that impact 
the PLT of a webpage. Under this assumption: 


f (dei, dee, agen) = Fide) 
1 


Here, d,; denotes the change of delay factor x;. f;(dz;) 
is a function that describes the PLT change when only x; 
changes. f(dx1, dz2, ...,dxn) is a function that describes 
the PLT change when all the x;’s change simultaneously. 
In essence, this equation says the PLT change caused by 
each x; is independent from the PLT changes caused by 
other delay factors. If this assumption holds, the number 
of measured scenarios required for prediction will be- 
come linear to the number of delay factors, significantly 
reducing the measurement overheads. Recently, Chen et 
al. developed a latency prediction tool based on similar 
assumption [10]. 

In the following, we study to what extent such inde- 
pendence assumption affects the prediction accuracy. We 
use the same baseline scenario as that of Table 2 in 87.1, 
in which both RTT and DNS delay are inflated by 100 
ms for all the objects. We still use the scenario without 
injected delay as the new scenario. For each web ser- 
vice, we divide the objects in the page into three groups 
(G1, Gz, and G3) and subsequently measure the PLT 
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change 0; when we only change the delay factors for one 
group G;; at a time. We then predict the PLT change be- 
tween the baseline and new scenarios by taking the sum 
of 6;’s. As shown in column “Indep err®°” of Table 2, 
the prediction errors are significantly higher than those 
of WebProphet for Google/Yahoo Maps. In particular 
for Yahoo Maps, err?° is as high as 85.5%. The pre- 
diction errors are smaller for Google Search because its 
webpage has very simple dependencies. Since each 0; 
is directly measured instead of being predicted by any 
model, the prediction error should be close to zero when 
the delay factors are indeed independent. The results of 
the experiment highlight the importance of capturing ob- 
ject dependencies for accurate PLT prediction. 


7.4 Importance of identifying stream par- 
ent 


One of the key steps in our PDG extraction is to iden- 
tify stream parents and dependency offsets (83.2). We 
now evaluate the importance of identifying stream par- 
ents in prediction accuracy. We use the same baseline 
and new scenarios as those in the PlanetLab experiments 
in $7.2. The only difference is that we ignore all the de- 
pendencies on stream parents in the PDGs when we make 
predictions. As shown in column “No-Stream err?°” 
in Table 4, the prediction errors without stream parents 
are much higher than those with stream parents for the 
Search services. Nonetheless, the prediction errors are 
roughly the same for the Maps services. This is because 
the HTML objects account for a significant portion of the 
Search pages. In contrast, most of the objects in the Maps 
pages are non-stream ones, e.g., Javascripts and images. 


8 Usage Scenarios 


As illustrated in the previous sections, the PLT of a com- 
plex webpage may depend on the delay factors of many 
objects. The owner of the page often faces the challenge 
of finding a cost-effective way to improve service per- 
formance from a huge number of possible optimization 
strategies. Since WebProphet can make fast and accurate 
prediction under the changes of any delay factor and/or 
object, it provides the service owner an easy way to nar- 
row down the strategies that could bring the most benefit. 

Because Yahoo Maps has the most complex webpage 
and the largest median PLT (measured from Northwest- 
ern University) among all the four services, we use it as 
an example to demonstrate the power of WebProphet. 
Though we cannot directly validate the effect of these 
changes, the experiments described in 87 provide a ba- 
sis for trust in the predictions. Suppose the owners of 
Yahoo Maps are considering three methods to optimize 
the median PLT: i) OPT;,44: reducing the RTT of certain 
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static objects by moving them from Yahoo data centers to 
the Akamai CDN; ii) OPT se, e,: reducing the server re- 
sponse time by half for certain dynamic objects; and 111) 
OPT tient: reducing the client execution time by half 
for certain objects. Since the Yahoo Maps page contains 
about 110 objects including roughly 74 static objects and 
36 dynamic ones, it could be too costly to optimize for 
all of them. Hence, we seek to identify a small set of 
candidate objects whose optimization would lead to sig- 
nificant PLT reduction. 

In this paper, we use a simple greedy-based algorithm 
to search for those candidate objects. In the future, we 
could also leverage other more sophisticated search al- 
gorithms (such as simulated annealing) to obtain better 
results. Our search algorithm considers one of the opti- 
mization methods (OPT}4:, OPT server, Of OPTetient) 
at a time. It starts with a list of all the objects and 
the original object timing information extracted from the 
page load trace that corresponds to the median of the 
baseline PLT distribution. At each step, it greedily picks 
the candidate object whose optimization will lead to the 
largest PLT reduction among all the remaining objects. 
It then removes the new candidate object from the list 
and updates its timing information according to the opti- 
mization method. This process terminates when the PLT 
reduction resulting from the optimization of a new can- 
didate object becomes negligibly small. 

After evaluating 2,176 hypothetical scenarios, we 
identify 5 candidate objects for OPT 5¢,ye, and OPT) 14 
respectively. We also identify 14 candidate objects for 
OPT tient. The predicted PLT reductions by applying 
OPT 4, OPTetient, and OPT seryer are 14.8%, 26.6%, 
and 1.6% respectively. Apparently, OPT'5¢,ye, does not 
seem to be promising, since it can only reduce PLT 
slightly. The PLT can be further reduced by 40.1% by 
combining OPT;4; and OPT tient together. Therefore, 
by simply optimizing the client execution time of 14 ob- 
jects and moving 5 static objects to Akamai CDN, we 
predict that the median PLT of Yahoo Maps can be cut 
from 3.99 to 2.39 seconds. 


9 Systems Evaluation 


We now evaluate the systems overhead of dependency 
extraction and performance prediction. The dependency 
extraction process includes two steps: 1) subsequently 
control the download of each object during a page load; 
and 2) extract the PDG from the recorded timing infor- 
mation of object download (83.2). Step 2 is relatively 
simple. The running time is dominated by step 1 because 
we need to reload a page many times and artificially de- 
lay the download of one object during each page load. 
Given a page with n objects and m stream objects, we 
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need to reload the page n times to discover all the de- 
scendants and at most m x n times to discover all the 
stream parents and dependency offsets (83.2). All the 
webpages we have studied so far have only a few stream 
(HTML) objects. Thus, the running time is roughly lin- 
ear to the total number of objects in the page. Even for 
Yahoo Maps which has the most complex PDG, the run- 
ning time is only two hours. Note that since each con- 
trolled page load is independent, we can easily run de- 
pendency extraction on multiple machines in parallel to 
speed up the process. 

To predict the PLT of a page, we first need to parse 
page load traces to extract object timing information and 
then to simulate page load process under new scenarios 
(84). We evaluate the performance predictor on a com- 
modity server with two 2.5 GHz Xeon processors and 16 
GB memory running Linux 2.6.18. We use Yahoo Maps 
as an example because its page incurs the largest pre- 
diction time among the four services. The trace parsing 
time depends on the size of the traces. For Yahoo Maps, 
it takes 317 seconds to parse 500 page load traces of one 
scenario with a total size of 455 MB. The page load sim- 
ulation time depends on the complexity of the PDG. For 
Yahoo Maps, it takes about 9 ms to run one page load 
simulation under our current implementation in Python. 
This translates to a total of 20 seconds simulation time to 
evaluate all of the 2,176 hypothetical scenarios in 88. We 
could further optimize the running time by rewriting the 
simulation code in C/C++. 


10 Related Work 


There is a large body of prior work on web performance 
measurement and modeling. For instance, Smith et al. 
leveraged TCP/IP headers in packet traces to character- 
ize the nature of web traffic and the structure of web- 
pages [21]. Nahum et al. built an emulator to study the 
impact of network delay and packet loss on web server 
performance [17]. Olshefski et al. developed techniques 
for inferring client response time from server-side logs. 
These works either treat a webpage as a single object or 
treat each web object independently while ignoring the 
dependencies between different objects. 

Web performance measurement tools, such as Firebug 
and IBM Page Detailer [2], can provide detailed object 
timing information of a page load. Nonetheless, they can 
neither extract object dependencies nor perform PLT pre- 
dictions. 

CPRT [19] used client-side Javascript code to mea- 
sure user-perceived response times. AjaxScope [15] pro- 
vided more detailed Javascript code instrumentations to 
debug client-side errors. Due to the limited informa- 
tion exposed by the browser and OS to the Javascript 
layer, these approaches cannot reason about the impact 


of network-layer conditions, such as DNS delay or RTT, 
on web service performance. 

Several existing systems, e.g., Orion [Il], eX- 
pose [13], NetMedic [14], and Sherlock [9], employed 
various techniques to automatically infer causalities be- 
tween hosts, processes, and network flows. They lever- 
aged these causalities to diagnose performance problems 
in network applications. In contrast, WebProphet fo- 
cuses on extracting dependencies between web objects 
and predicting the performance of web applications. 

Wischik used a manually constructed dependency 
graph of Gmail to study the effects of TCP parameter 
settings on web performance [25]. In this paper, we for- 
mally define the object dependencies of a webpage in- 
cluding ancestors, stream and non-stream parents, and 
dependency offsets. We further develop an automated 
system to extract the PDG of a webpage. 


11 Conclusion 


We built WebProphet, a system that automates perfor- 
mance prediction for web services. The key idea of 
WebProphet is the use of PDG to encapsulate web ob- 
ject dependencies for accurate and scalable predictions. 
WebProphet leverages a novel technique based on tim- 
ing perturbation to extract object dependencies of com- 
plex webpages. It implements a simple and yet effec- 
tive model to simulate the page load process of a web 
browser, which enables accurate PLT prediction under 
changes to any web objects. It can also predict the statis- 
tical properties of a PLT distribution under a hypothetical 
scenario. Applying WebProphet to the Search and Maps 
services of Google and Yahoo, we successfully extract 
their PDGs and keep the PLT prediction error rates un- 
der 16% in most cases. Our results show WebProphet 
provides a solid foundation for web service providers to 
quickly find cost-effective optimization strategies for real 
applications. 
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Abstract 


Mugshot is a system that captures every event in an ex- 
ecuting JavaScript program, allowing developers to de- 
terministically replay past executions of web applica- 
tions. Replay is useful for a variety of reasons: failure 
analysis using debugging tools, performance evaluation, 
and even usability analysis of a GUI. Because Mugshot 
can replay every execution step that led to a failure, it is 
far more useful for performing root-cause analysis than 
today’s commonly deployed client-based error reporting 
systems—core dumps and stack traces can only give de- 
velopers a snapshot of the system after a failure has oc- 
curred. 

Many logging systems require a specially instru- 
mented execution environment like a virtual machine or 
a custom program interpreter. In contrast, Mugshot’s 
client-side component is implemented entirely in stan- 
dard JavaScript, providing event capture on unmodified 
client browsers. Mugshot imposes low overhead in terms 
of storage (20-80KB/minute) and computation (slow- 
downs of about 7% for games with high event rates). 
This combination of features—a low-overhead library 
that runs in unmodified browers—makes Mugshot one 
of the first capture systems that is practical to deploy to 
every client and run in the common case. With Mugshot, 
developers can collect widespread traces from programs 
in the field, gaining a visibility into application execution 
that is typically only available in a controlled develop- 
ment environment. 


1 Introduction 


Despite developers’ best efforts to release high qual- 
ity code, deployed software inevitably contains bugs. 
When failures are encountered in the field, many pro- 
grams record their state at the point of failure, e.g., in 
the form of a core dump, stack trace, or error log. That 
snapshot is then sent back to the developers for analysis. 


Perhaps the best known example is the Windows Error 
Reporting framework, which has collected over a billion 
error reports from user programs and the kernel [14]. 

Unfortunately, isolated snapshots only tell part of the 
story. The root cause of a bug is often difficult to deter- 
mine based solely on the program’s state after a problem 
was detected. Accurate diagnosis often hinges on an un- 
derstanding of the events that preceded the failure. For 
this reason, systems like Flight Data Recorder [29], De- 
jaVu [5], and liblog [13] have implemented deterministic 
program replay. These frameworks log enough informa- 
tion about a program’s execution to replay it later un- 
der the watchful eye of a debugging tool. With a few 
notable exceptions, these systems require a specially in- 
strumented execution environment like a custom kernel 
to capture a program’s execution. This makes them un- 
suitable for field deployment to unmodified end-user ma- 
chines. Furthermore, no existing capture and replay sys- 
tem is specifically targeted for the unique needs of the 
web environment. 

Mugshot’s goal is to provide low-overhead, “‘always- 
on’ capture and replay for web-deployed JavaScript pro- 
grams. Our key insight is that JavaScript provides 
sufficient reflection capabilities to log client-side non- 
determinism in standard JavaScript running on unmod- 
ified browsers. AS a user interacts with an application, 
Mugshot’s JavaScript library logs explicit user activity 
like mouse clicks and “‘behind-the-scenes” activity like 
the firing of timer callbacks and random number gener- 
ation. When the application fetches external objects like 
images, a server-side Mugshot proxy stores the binary 
data so that at replay-time, requests for the objects can 
access the log-time versions. 

The client-side Mugshot log is sent to the devel- 
Oper in response to a trigger like an unexpected ex- 
ception being caught. Once the developer has the log, 
he uses Mugshot’s replay mode to recreate the original 
JavaScript execution on his unmodified browser. Like 
the logging library, the replay driver is implemented in 
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standard JavaScript. The driver provides a “VCR” in- 
terface which allows the execution to be run in near-real 
time, paused, or advanced one event at a time. The in- 
ternal state of the replaying application can be inspected 
using unmodified JavaScript debugging tools like Fire- 
bug [17]. Developers can also analyze a script’s perfor- 
mance or evaluate the usability of a graphical interface 
by examining how real end-users interacted with it. 


At first glance, Mugshot’s logging and replay capabil- 
ities may seem to introduce a fundamentally new threat 
to user privacy. However, Mugshot does not create new 
techniques for logging previously untrackable events— 
instead, it leverages the preexisting introspective capa- 
bilities found in browsers’ standard JavaScript engines. 
Furthermore, the preexisting security policies which pre- 
vent cross-site data exchange also prevent Mugshot’s 
event logs from leaking across domains. Thus, the 
Mugshot log for a particular page can only be accessed 
by that page’s domain. 

The rest of the paper is organized as follows. In Sec- 
tion 2, we review related work in capture and replay. We 
then describe the architecture of Mugshot in Section 3. 
Section 4 contains our evaluation, which describes m1- 
crobenchmarks (84.1) and Mugshot’s performance in- 
side several complex, real-world applications (84.2). Our 
evaluation shows that Mugshot is unobtrusive at logging 
time and faithful at replay time, recreating several bugs 
that we found in our evaluation applications. We con- 
sider the privacy implications of Mugshot in Section 5 
and then conclude in Section 6. 


2 Related Work 


Error Reporting from the Field 


There are many frameworks for collecting information 
about crashed programs and sending the data back to the 
developers. Perhaps the best known is Windows Error 
Reporting (WER), which has been included in Windows 
since 1999 and has collected billions of error reports 
[14]. When a crash, hang, installation failure, or other 
error is detected, WER creates a minidump. Minidumps 
are snapshots of the system’s essential state: register val- 
ues, thread stacks, lists of loaded modules, a portion of 
the text segment surrounding the instruction pointer, and 
other information. With the user’s permission, this data 
is sent back to Microsoft, where it is bucketed according 
to likely root-cause for later analysis. WER only records 
the state of the application at the moment of the crash; 
developers must infer the sequence of events that led 
to it. Other deployed systems have similar capabilities 
and constraints, including Firefox’s Breakpad [4] and the 
iPhone OS [1]. 
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Capture and Replay 


As described by the survey papers [8] and [10], de- 
terministic replay has been studied for many years in 
a variety of contexts. Some of the prior systems are 
machine-wide replay frameworks intended to debug op- 
erating systems or cache coherence in distributed shared 
memory systems [11, 22, 29]. They create high-fidelity 
execution logs, but with significant cost: the target soft- 
ware must run atop custom hardware, a modified kernel, 
or a special virtual machine. This limits opportunities 
for widespread deployment. Furthermore, these systems 
produce log data at a high rate; generally, these systems 
can only report the last few seconds of system state be- 
fore an error. 


Moving up the stack, a number of application- or 
language-specific tools allow deterministic capture and 
replay. By restricting themselves to high-level interfaces 
and a single-threaded execution model, they obviate the 
need to log data at the instruction level. This dramati- 
cally reduces the overhead of logging, both in processor 
time and storage requirements. The liblog system[13] 
is perhaps closest to our work. liblog provides a C- 
library interposition layer that records the input and out- 
put of all interactions between the application and the C 
library (and hence, the operating system) below. As with 
Mugshot, one of liblog’s goals was to make logging suffi- 
ciently lightweight that it can be run in the common case. 
Other application-specific logging environments include 
DejaVu [5] for Java programs and Retrospect [3] for par- 
allel programs written using the MPI interface. 


Ripley [20] is a framework for preserving the compu- 
tational integrity of AJAX applications. Client and server 
code is written in . NET, but Ripley automatically trans- 
lates the client-side portion into JavaScript for execution 
in a browser. The instrumented JavaScript sends an event 
stream to the web server, which replays the events to a 
server-side replica of the client. The server only executes 
a client RPC if it is also generated by the replica. 


Mugshot differs from Ripley in three important ways. 
First, Mugshot works on arbitrary JavaScript applica- 
tions and does not require applications to be developed 
in a special environment. Second, Ripley’s current im- 
plementation does not capture all sources of nondeter- 
minism. For example, Ripley does not handle calls to 
Date(). It could treat Date () as an RPC and have 
the client synchronously fetch a value from the server (a 
value which would also be fed to the server-side client 
replica). However, this incurs a round trip for each time 
request, making it infeasible for applications like games 
that rapidly generate events. Third, for performance rea- 
sons, Ripley’s server-side client replicas are not actual 
web browsers—they are lightweight browser emulators 
that track DOM state (83.1.3) but do not perform layout 
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or rendering. In contrast, Mugshot replays events inside 
the same browser type used at logging time. This greatly 
increases the likelihood that a buggy execution can be 
recreated. 


Debugging for Web Applications 


There are a variety of tools for debugging web appli- 
cations. For example, Fiddler [21] is a web proxy that al- 
lows the local user to inspect, modify, and replay HTTP 
messages. Firebug [17], an extension to Firefox, is an 
advanced JavaScript debugger that supports breakpoints, 
arbitrary expression evaluation, and performance profil- 
ing. Internet Explorer 8 has a built-in debugger with sim- 
ilar features. All of these tools provide rich introspection 
upon the local execution environment. However, none of 
them provide a way to capture remote bugs in the wild 
and explore the execution paths that led to faulty behav- 
10F. 

AjaxScope [18] uses a web proxy to dynamically 
instrument JavaScript code before sending it to re- 
mote clients. Developers express their debugging intent 
through functions inserted at specific places in the code’s 
abstract syntax tree. For example, to check for infinite 
loops, a programmer can attach a diagnostic function to 
each for, while, and do—while statement. Whereas 
AjaxScope’s goal 1s to let developers express specific de- 
bugging policies, Mugshot focuses on recreating entire 
remote execution contexts. 

The commercial Selenium [27] Firefox extension 
records user activity for later playback. Recording can 
only be done in Firefox, but playback is portable across 
browsers using synthetic JavaScript events. Because 
Selenium does not log the full set of nondeterministic 
events, it is suitable for automating tests, but it cannot 
reproduce many nondeterministic bugs. 

The commercial products ClickTale [7] and CS Ses- 
sionReplay [12] capture mouse and keyboard events in 
browser-based applications. However, neither of these 
products expose a full, browser-neutral environment for 
logging all sources of browser nondeterminism, includ- 
ing both client-side nondeterminism like timer inter- 
rupts and server-side nondeterminism like dynamic im- 
age generation. The services provide click analytics and 
a movie of client-visible interactions, but not the underly- 
ing internal state of the JavaScript heap and the browser 
DOM tree. 


3 Design and Implementation 


Mugshot’s goal is to record the execution of a web ap- 
plication on an end user’s machine, then recreate that ex- 
ecution on a developer’s machine. To capture applica- 
tion activity, one could exhaustively record every inter- 
mediate state of the program. Mugshot instead takes the 


approach of many other systems: recording all sources 
of nondeterminism. If an application is run again and 
injected with the same nondeterministic events, the pro- 
gram will follow the same execution path that was ob- 
served at logging time. 

Past systems have recorded nondeterminism at the in- 
struction level [11] or the 1ibc level [13]. However, 
the former may introduce prohibitive logging overheads, 
and both require users to modify standard browsers or 
operating systems. Both approaches also record nonde- 
terminism at a granularity that is unnecessarily fine for 
JavaScript-driven web applications. JavaScript programs 
are non-preemptively single threaded and event driven. 
Applications register callback functions for events like 
key strokes or the completion of an asynchronous HTTP 
request. When the browser detects such an event, it in- 
vokes the appropriate application handlers. The browser 
will never interrupt the execution of one handler to run 
another. Thus, the execution path of the application is 
solely determined by the event interleavings encountered 
during a particular run. This means that logging the con- 
tent and the ordering of events provides sufficient infor- 
mation for replay. 

Logging nondeterminism at the level of JavaScript 
events would be easy if we could insert logging code 
directly into the browser. However, this solution is un- 
appealing to developers since it requires users to down- 
load a special browser or install a logging plug-in. Many 
users will not opt into such a scheme, dramatically re- 
ducing the size and diversity of the developer’s logging 
demographic. 

To avoid these problems, we implemented the client 
portion of Mugshot entirely in JavaScript. Compared 
to an in-browser solution, a JavaScript implementation 
is more complex and more difficult to make performant. 
However, it has the enormous advantage of being trans- 
parent to users and hence much easier to deploy. As we 
will see in the sections that follow, JavaScript offers suf- 
ficient introspection and self-modification capabilities to 
enable insertion of shims that log most sources of non- 
determinism. 

In Section 3.1, we enumerate the sources of nondeter- 
minism in web applications and describe how Mugshot 
captures them in Firefox and IE. Although conceptu- 
ally straightforward, the logging process is complicated 
by various browser incompatibilities and implementation 
deficiencies, particularly with respect to keyboard events. 
In Section 3.2, we describe how Mugshot replays an ex- 
ecution by dispatching synthetic events from its log. 


3.1 Capturing Nondeterministic Events 


To add Mugshot recording to an application, the 
developer delivers an application through a server-side 
web proxy. The proxy’s first job is to insert a single tag 
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at the beginning of the application’s <head> block: 
<script src=’Mugshot.js’ ></script> 

When the page loads, the Mugshot library runs before 
the rest of the application code has a chance to execute. 
Mugshot interposes on the sources of nondeterminism 
that we describe below and begins to write to an in- 
memory log. Event recording continues until the page is 
closed. 

If the application contains multiple frames, the proxy 
injects the Mugshot <script> tag into each frame. 
Child frames report all events to the Mugshot library run- 
ning in the topmost frame; this frame is responsible for 
collating the aggregate event log and sending it back to 
the developer. 

The developer controls when the application uploads 
event logs. For example, the application may post logs at 
predefined intervals, or only if an unexpected exception 
is thrown. Alternatively, the developer may add an ex- 
plicit “Send error report’ button to the application which 
triggers a log post. 

Figure 1 lists the various sources of nondeterminism 
in web applications. In the sections below, we discuss 
each of the broad categories and describe how we cap- 
ture them on Firefox and IE. Our discussion proceeds 
in ascending order of the technical difficulty of logging 
each event category. 

Importantly, Mugshot does not log events for media 
objects that are opaque to JavaScript code. For exam- 
ple, Mugshot does not record when users pause a Flash 
movie or click a region inside a Java applet. Since these 
objects do not expose a JavaScript-accessible event inter- 
face, Mugshot can make no claims about their state. The 
current implementation of Mugshot also does not capture 
nondeterministic events arriving from opaque containers 
like Flash’s ExternalInterface; such events are 
rarely used in practice. 

For each new event that it does capture, Mugshot cre- 
ates a log entry containing a sequence number and the 
wall clock time. The entry also contains the event type 
and enough type-specific data to recreate the event at re- 
play time. For example, for keyboard events, Mugshot 
records the GUI element that received the event, the 
character code for the relevant key, and whether any of 
the shift, alt, control, or meta keys were simultaneously 
pressed. 


3.1.1 Nondeterministic Function Calls 


Applications call new Date() to get the current 
time and Math. random () to get arandom number. To 
log time queries, Mugshot wraps the original construc- 
tor for the Date object with one that logs the returned 
time. To log random number generation, Mugshot re- 
places the built-in Math. random() with a simple lin- 
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ear congruential generator [23]. Mugshot uses the appli- 
cation’s load date to seed the generator, and it writes this 
seed to the log. Given this seed, subsequent calls to the 
random number generator are deterministic and do not 
require subsequent log entries. 

Our initial implementation of Mugshot did not de- 
fine a custom random number generator—instead, it sim- 
ply wrapped Math.random() in the same way that it 
wrapped Date (). However, we found that games of- 
ten made frequent requests for random numbers, e.g., to 
determine whether a space invader should move up or 
down. The resulting logs were filled with random num- 
bers and did not compress well (which was important, 
since uncompressed logs can be large (§ 4.2.1)). Thus, 
we decided to use the logging scheme described above. 


3.1.2 Interrupts 


JavaScript interrupts allow applications to sched- 
ule callbacks for later invocation. Callbacks 
can be scheduled for one-time execution using 
setTimeout (callback, waitTime). A 
callback can be scheduled for periodic execution using 
setInterval (callback, period). JavaScript 
is cooperatively single threaded, so interrupt callbacks 
(and event handlers in general) execute atomically and 
do not preempt each other. 

Mugshot logs interrupts by wrapping the standard 
versions of set Timeout () and setInterval (). 
The wrapped registration functions take an application- 
provided callback, wrap it with logging code, and regis- 
ter the wrapped callback with the native interrupt sched- 
uler. Mugshot also assigns the callback a unique id; since 
JavaScript functions are first class objects, Mugshot 
stores this id as a property of the callback object. Later, 
when the browser invokes the callback, the wrapper code 
logs the fact that a callback with that id executed at the 
current wall clock time. 

Although simple in concept, IE does not support this 
straightforward interposition on setTimeout () 
and setInterval(). Mugshot’s modified 
setTimeout () must hold a reference to the browser’s 
original set Timeout () function; however, IE some- 
times garbage collects this reference, leading to a 
“function not defined” error from the Mugshot wrapper. 
To mitigate this problem, Mugshot creates an invisible 
<iframe> tag, which comes with its own namespace 
and hence its own references to set Timeout () and 
setInterval(). The Mugshot wrapper invokes 
copies of these references when it needs to schedule a 
wrapped application callback. 

Although this trick gives Mugshot references to the 
native scheduling functions, it prevents Mugshot from 
actually scheduling callbacks until the hidden frame 
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Figure 1: Sources of nondeterminism in browsers. 


For each AJAX event, Mugshot logs the current state 
of the request (e.g., “waiting for data’), the HTTP head- 


is loaded. This problem has three cascading con- 
sequences. First, since JavaScript is single-threaded, 
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Mugshot cannot block until the hidden frame is loaded 
without hanging the application. Instead, it must queue 
application timer requests and install them once the 
hidden frame loads. Second, setTimeout() and 
setInterval () return opaque scheduling identifiers 
that the application can use to cancel the callback via 
clearTimeout() and clearInterval(). For 
interrupt registrations issued before the hidden frame 
loads, Mugshot cannot call the real registration functions 
to get cancellation ids. So, Mugshot generates synthetic 
identifiers and maintains a map to the real identifiers it 
acquires later. Third, an application may cancel an inter- 
rupt before the hidden frame loads; Mugshot responds by 
simply removing the callback from its queue of requests. 


AJAX requests allow JavaScript applications to 1s- 
sue asynchronous web fetches. The browser repre- 
sents each request as an XMLHttpRequest object. To 
receive notifications about the status of the request, 
applications assign a callback function to the object’s 
onreadystatechange property. The browser in- 
vokes this function whenever new data arrives or the en- 
tire transmission is complete. Upon success or failure, 
the various properties of the object contains the status 
code for the transfer (e.g., 200 OK) and the fetched data. 


Mugshot must employ different techniques to wrap 
AJAX callbacks on different browsers. On Firefox, 
Mugshot’s wrapped XMLHttpRequest constructor 
registers a DOM Level 2 handler (83.1.3) for the object’s 
onreadystatechange event. IE does not support 
DOM Level 2 handlers on AJAX objects, so Mugshot 1n- 
terposes on the object’s send method to wrap the appli- 
cation handler in logging code before the browser issues 
the request. We describe DOM Level 2 handlers in more 
detail in the next section. 


ers, and any incremental data that has already returned. 
Once the request has completed, Mugshot logs the HTTP 
status codes. We also log the raw request data on the 
server-side replay proxy (83.2.1). For the purposes of re- 
play, this data only needs to be logged on one side. Thus, 
our current implementation consumes more space than 
strictly necessary. However, it makes the client-side and 
proxy-side logs more understandable to human debug- 
gers, since AJAX activity in one log does not have to be 
collated with data from the other log. 


3.1.3 DOM Events 


The Document Object Model (or DOM) is the inter- 
face between JavaScript applications and the browser’s 
user interface [28]. Using DOM calls, JavaScript ap- 
plications register handlers for user events like mouse 
clicks. DOM methods also allow the application to dy- 
namically modify page content and layout. 

The browser binds every element in a page’s HTML to 
an application-accessible JavaScript object. Applications 
attach event handlers to these DOM objects, informing 
the browser of the application code that should run when 
a DOM node generates a particular event. In the simplest 
handler registration scheme, applications simply assign 
functions to specially-named DOM node properties. For 
example, to execute code whenever the user clicks on 
a <div> element, the application assigns a function to 
the onclick property of the corresponding JavaScript 
DOM node. 

This simple model, called DOM Level 0 registration, 
only allows a single handler to be assigned to each DOM 
node/event pair. Modern browsers also implement the 
DOM 2 model, which allows an application to register 
multiple handlers for a particular DOM node/event pair. 
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Div C 
2. Target 


<div onclick=’ handlerA()’> 
<div onclick=’ handlerB()’> 
<div onclick=’handlercC()’> 
<fdaiv> 
</div> 
</div> 





Figure 2: Event handling after a user clicks within Div 
C. In the W3C model’s capturing phase, handlerA() is 
called if it is a capturing handler, followed by handlerB() 
if it is a capturing handler. After the target’s handlerC() 
is called, W3C mandates a bubbling phase in which han- 
dlers marked as bubbling are called from the inside out. 


An application calls the node’s attachEvent () (IE) 
or addEventListener () (Firefox) method, passing 
an event name like “click’’, a callback, and in Firefox, a 
useCapture flag to be discussed shortly. 


The World Wide Web Consortium’s DOM Level 2 
specification [28] defines a three-phase dispatch process 
for each event (Figure 2). In the capturing phase, the 
browser hands the event to the special window and 
document JavaScript objects. The event then traces 
a path down the DOM tree, starting at the top-level 
<html> DOM node and eventually reaching the DOM 
node that actually generated the event. The boolean pa- 
rameter in addEventListener() allows a handler 
to be specified as capturing. Capturing handlers are only 
executed in the capturing phase; they allow a DOM node 
to execute code when a child element has generated an 
event. Importantly, the ancestor’s handler will be called 
before any handlers on the child run. 


In the target phase, the event is handed to the DOM 
node that generated it. The browser executes the appro- 
priate handlers at the target, and then sends the event 
along the reverse capturing path. In this final bubbling 
phase, ancestors of the target can run event handlers 
marked as bubbling, allowing them to process the event 
after it has been handled by descendant nodes. 


In the DOM 2 model, some event types are cancelable, 
1.e., an event handler can prevent the event from con- 
tinuing through the three phase process. Also, although 
all events capture, some do not bubble. For example, 
load events, which are triggered when an image has 
completely downloaded, do not bubble. Form events also 
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do not bubble. Examples of form events include focus 
and blur, which are triggered when a GUI element like 
a text box gains or loses input focus. 


3.1.4 DOM Events and Firefox 


Firefox supports the W3C model for DOM events. 
Thus, Mugshot can record these events in a straightfor- 
ward way—it simply attaches capturing logging handlers 
to the window object. Since the window object is the 
highest ancestor in the DOM event hierarchy, Mugshot’s 
logging code is guaranteed to catch every event before 
it has an opportunity to be canceled by other nodes in 
the capture, target, or bubble phases. Note that canceled 
events still need to be logged, since they caused at least 
one application handler to run! 

Mugshot must ensure that the application does not 
accidentally delete or overwrite Mugshot’s logging 
handlers. ‘To accomplish this, Mugshot registers the 
logging handlers as DOM 2 callbacks, exploiting the fact 
that applications cannot iterate over the DOM 2 handlers 
for a node, and they cannot deregister a DOM 2 han- 
dler via domNode.detachEvent (eventName, 
callback) without knowing the callback’s function 
pointer. 

Mugshot must also ensure that its DOM 2 window 
handlers run before any application-installed window 
handlers execute and potentially cancel an event. Firefox 
invokes a node’s DOM 2 callbacks in the order that they 
were registered; since the Mugshot library registers its 
handlers before any application code has run, its logging 
callbacks are guaranteed to run before any application- 
provided DOM 2 window handlers. 

Unfortunately, Firefox invokes any DOM 0 handler 
on the node before invoking the DOM 2 handlers. To 
ensure that Mugshot’s DOM 2 handler runs before any 
application-provided DOM 0 callback, Mugshot uses 
JavaScript setters and getters to interpose on assignments 
to DOM 0 event properties. Setters and getters define 
code that is bound to a particular property on a JavaScript 
object. The setter is invoked on read accesses to the prop- 
erty, and the getter is invoked on writes. 

Ideally, Mugshot would define a DOM 2 logging han- 
dler for each event type e, and create setter code for 
the window.e property which wrapped the the user- 
specified handler with a Mugshot-provided logging func- 
tion. If the application provided no DOM 0 handler, 
Mugshot’s DOM 2 callback would log the event; oth- 
erwise, the wrapped DOM 0 handler would log the event 
and set a special flag on the event object indicating that 
Mugshot’s DOM 2 handler should not duplicate the log 
entry. Unfortunately, this scheme will not work because 
Firefox’s getter/setter implementation is buggy. Mugshot 
can create a getter/setter pair fora DOM node event prop- 
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erty, and application writes to the property will properly 
invoke the setter. However, when an actual event of type 
e 1s generated, the browser will not invoke the associated 
function. In other words, the setter code, which works 
perfectly at the application level, hides the event handler 
from the internal browser code. 

Luckily, the setter code does not prevent the browser 
from invoking DOM 2 handlers. Thus, Mugshot’s setter 
also registers the application-provided handler as a DOM 
2 callback. The setter code ensures that when the appli- 
cation overwrites the DOM 0 property name, Mugshot 
deregisters the shadow DOM 2 version of the old DOM 
O handler. 

When Mugshot logs a DOM event, it records an iden- 
tifier for the DOM node target. If the target has an HTML 
id, e.g., <div id=’ foo’ >, Mugshot tags the log en- 
try with that id. Otherwise, it identifies the target by 
specifying the capturing path from the root <htm1> tag. 
For example, the id (1,5) specifies that the target can 
be reached by following the first child of the <html> 
tag and then the fifth child of that node. Since many 
JavaScript applications use dynamic HTML, the path for 
a particular node may change throughout a program’s ex- 
ecution. Thus, the determination of a target’s path id 
must be done at the time the event is seen—it cannot be 
deferred until (say) the time that the log is posted to the 
developer. 


3.1.5 DOM Events and IE 


IE’s event model is only partially compatible with 
the W3C one. The most important difference is that 
IE does not support the capturing phase of event prop- 
agation. This introduces two complications. First, an 
event may never have an opportunity to bubble up to 
a window-level logging handler—the event might be 
canceled by a lower-level handler, or it may be a non- 
bubbling event like a load. Second, even if an event bub- 
bles up to a window-level logger, the event may have 
triggered lower-level event handlers and generated log- 
gable nondeterministic events. For example, a mouse 
click may trigger a target-level callback that invokes new 
Date(). The mouse click is temporally and causally 
antecedent to the time query. However, the mouse click 
would be logged after the time query, since the time 
query is logged at its actual generation time, whereas 
the mouse click is logged after it has bubbled up to the 
window-level handler. 

Mugshot addresses these problems using several tech- 
niques. To log non-bubbling events, Mugshot ex- 
ploits IE’s facility for extending the object prototypes 
for DOM nodes. For DOM types like Images and 
Inputs which support non-bubbling events, Mugshot 
modifies their class definitions to define custom setters 


for DOM 0 event properties. Mugshot also redefines 
attachEvent () and detachEvent (), the mech- 
anisms by which applications register DOM 2 handlers 
for these nodes. The DOM 0 setters and the wrapped 
DOM 2 registration methods collaborate to ensure that 
if an application defines at least one handler for a DOM 
node/event pair, Mugshot will log relevant events pre- 
cisely once, and before any application-specified handler 
can cancel the event. 

Ideally, Mugshot could use the same techniques to 
capture bubbling events at the target phase. Unfortu- 
nately, IE’s DOM extension facility is fragile: redefining 
certain combinations of DOM 0 properties can cause un- 
predictable behavior. Therefore, Mugshot uses window- 
level handlers to log bubbling events; this is the problem- 
atic technique described above that may lead to temporal 
violations in the log. Fortunately, Mugshot can mitigate 
this problem in IE, because IE stores the current DOM 
event in a global variable window.event. Whenever 
Mugshot needs to log a source of nondeterminism, it first 
checks whether window.event is defined and refers 
to a not-yet-logged event. If so, Mugshot logs the event 
before examining the causally dependent event. 

An application may cancel a bubbling event before 
it reaches Mugshot’s window-level handler by setting 
its Event.cancelBubble property to true. The 
event still must be logged, so Mugshot extends the 
class prototype for the Event object, overriding its 
cancelBubble setter to log the event before its can- 
cellation. 

In summary, Mugshot on IE logs all bubbling DOM 
events, but only the non-bubbling events for which the 
application has installed handlers. This differs from 
Mugshot’s behavior on Firefox, where it logs all DOM 
events regardless of whether the application cares about 
them. Recording these “spurious” events does not af- 
fect correctness at replay time, but it does increase 
log size. Fortunately, as we show in Section 4.2.1, 
Mugshot’s compressed logs are small enough that the 
storage penalty for spurious events is small. Thus, we 
were not motivated to implement an IE-style logging so- 
lution for Firefox—it was technically feasible, but com- 
paratively much more difficult to implement correctly 
than our capturing-handler solution. 


3.1.6 Handling Load Events on IE 


In IE, Load events do not capture or bubble. Using the 
techniques described in the previous section, Mugshot 
can capture these events for elements with application- 
installed load handlers. However, Mugshot actually 
needs to capture all load events so that at replay time, 
it can render images in the proper order and ensure that 
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the page layout unfolds in the same fashion observed at 
logging time. Otherwise, an application that introspects 
the document layout may see different intermediate re- 
sults at replay time. 

Ideally, Mugshot could modify the prototype for 
Image objects such that whenever the browser created 
an Image node, the node would automatically install a 
logging handler for Load events. Unfortunately, pro- 
totype extension only works for properties and meth- 
ods accessed by application-level code—the browser’s 
native code creation of the DOM node cannot be mod- 
ified by extending the JavaScript-level prototype. So, 
Mugshot uses a hack: whenever it logs an event, it sched- 
ules a timeout to check whether that event has created 
new Image nodes; if so, Mugshot explicitly adds a 
DOM 2 logging handler which records load events for 
the image. Mugshot specifies this callback by invoking 
anon-logged set Timeout (imageCheck, 0). The 
0 value for the timeout period makes the browser invoke 
the imageCheck callback “as soon as possible.” Since 
the timeout is set from the context of an event dispatch, 
the browser will invoke the callback immediately after 
the dispatch has finished, but before the dispatch of other 
queued events (such as the load of an image that we 
want to log). Mugshot also performs this image check at 
the end of the initial page parse to catch the loading of 
the page’s initial set of images. 


3.1.7 Synthetic Events 


Applications call DOMnode.fireEvent() on IE 
and DOMnode.dispatchEvent () on Firefox to 
generate synthetic events. Mugshot uses these func- 
tions at replay time to simulate DOM activity from the 
log. However, the application being logged can also call 
these functions. These synthetic events are handled syn- 
chronously by the browser; thus, from Mugshot’s per- 
spective, they are deterministic program outputs which 
do not need to be logged. However, in terms of the event 
dispatching path, the browser treats the fake events just 
like real ones, so they will be delivered to Mugshot’s log- 
ging handlers. 


To prevent these events from _ getting 
logged on Firefox, Mugshot’ interposes_ on 
document.createEvent (), which — applica- 


tions must call to create the fake event that will be 
passed to dispatchEvent (). The interposed 
document.createEvent() assigns a_ special 
doNotLog property to the event before returning it to 
the application. Mugshot’s logging code will ignore 
events that define this property. 

This technique does not work on IE, which pro- 
hibits the addition of new properties to the Event ob- 
ject. Thus, Mugshot uses prototype extension to inter- 
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pose on fireEvent (). Inside the interposed version, 
Mugshot pushes an item onto a stack before calling the 
native fireEvent (). After the call returns, Mugshot 
pops an item from the stack. In this fashion, if Mugshot’s 
logging code for DOM events notices a non-empty stack, 
it knows that the current DOM event is a synthetic one 
and should not be logged. 


3.1.8 Annotation Events 


At replay time, Mugshot dispatches synthetic DOM 
events to simulate user GUI activity. These events are in- 
distinguishable from the real ones with respect to the dis- 
patch cycle—given a particular application state, a syn- 
thetic event will cause the exact same handlers to execute 
in exactly the same order as a semantically equivalent 
real event. However, we noticed that synthetic events 
did not always update the visible browser state in the ex- 
pected way. In particular, we found the following prob- 
lems on both Firefox and IE: 

e According to the DOM specification, when a 
keypress event has finished the dispatch cycle, 
the target text input or content-editable DOM node 
should be updated with the appropriate key stroke. 
Our replay experiments showed that this did not 
happen reliably. For example, synthetic key events 
could be dispatched to a text entry box, but the value 
of the box would not change, despite the fact that the 
browser invoked all of the appropriate event han- 
dlers. 

e <select> tags implement drop-down selection 
lists. Each selectable item is represented by an 
<option> tag. Dispatching synthetic mouse 
clicks to <opt ion> nodes should cause changes in 
the selected item property of the parent <select> 
tag. Neither Firefox nor IE provided this behavior. 

e Users can select text or images on a web 
page by dragging the mouse cursor or holding 
down the shift key while tapping a directional 
key. The browser visibly represents the selec- 
tion by highlighting the appropriate text and/or 
images. The browser internally represents the 
selected items as a range of underlying DOM 
nodes. Applications access this range by calling 
window.getSelection() on Firefox and in- 
specting the document.selection object on 
IE. We found that dispatching synthetic key and 
mouse events did not reliably update the browser’s 
internal selection range, and it did not reliably recre- 
ate the appropriate visual highlighting. 

To properly replay these DOM events, Mugshot de- 
fines special annotation events. Annotation events are 
“helpers” for events which, if replayed by themselves, 
would not produce a faithful recreation of the logging- 
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time application state. Mugshot inserts an annotation 
event into the log immediately after a DOM event which 
has low fidelity replay. At replay time, Mugshot dis- 
patches the low fidelity synthetic event, causing the ap- 
propriate event handlers to run. Mugshot then executes 
the associated annotation event, finishing the activity 1n- 
duced by the prior DOM event. Annotation events are 
not real events, so they do not trigger application-defined 
event handlers. They merely describe work that Mugshot 
must perform at replay time to provide faithful emulation 
of logging-time behavior. 


To fix low-fidelity keypress events on text inputs, 
Mugshot’s keypress logger schedules a timeout inter- 
rupt with an expiration time of 0. The browser executes 
the callback immediately after the end of the dispatch 
cycle for the keypress, allowing Mugshot to log the 
value of the text input. At replay time, after dispatching 
the synthetic keypress, Mugshot reads the value an- 
notation from the log and programmatically assigns the 
value to the target DOM node’s value property. 


To ensure that clicks on <option> elements actually 
update the chosen item for the parent <select> tag, 
Mugshot’s mouseup logger checks whether the event 
target is an <option> tag. If so, this indicates that the 
user has selected a new choice. Mugshot generates an 
annotation indicating which of the <select > tag’s chil- 
dren was clicked upon. At replay time, Mugshot uses the 
annotation to directly set the selectedIndex prop- 
erty of the <select> tag. 


Mugshot generates annotation events for selection 
ranges after logging keyup and mouseup events. On 
Firefox, the selection object conveniently defines a start- 
ing DOM node, a starting position within that node, an 
ending DOM node, and an ending position within that 
node. Mugshot simply adds the relevant DOM node 
identifiers and integer offsets to the annotation record. 
Abstractly speaking, Mugshot includes the same infor- 
mation for an annotation record on IE. However, IE 
does not provide a straightforward way to determine 
the exact extent of a selection range. So, Mugshot 
must cobble together several IE range primitives to de- 
duce the current range. Mugshot first determines the 
highest enclosing parent tag for the currently selected 
range. Then, Mugshot creates a range which covers all 
of the parent tag’s children, and progressively shrinks 
the number of HTML characters it contains, using IE’s 
Range.inRange() to determine whether the actual 
selection region resides within the shrinking range. At 
some point, Mugshot will determine the exact amounts 
by which it must shrink the left and right margins of the 
parent range to precisely cover the actual selected region. 
Mugshot logs the DOM identifier for the parent node and 
the left and right pinch margins. 


IE’s selection semantics are extremely complex, and 
we have not produced a complete formal specification for 
them. Since Mugshot cannot currently reproduce these 
semantics in all applications, Figure 1 lists Mugshot’s 
support for IE selection events as partial. 


3.1.9 Performance Optimizations 


Both Firefox and IE support the W3C mousemove 
event, which is fired whenever the user moves the mouse. 
Mugshot can log this event like any other mouse action, 
but this can lead to unnecessary log growth in Firefox if 
the application does not care about this event (remem- 
ber that Mugshot on Firefox logs all DOM events, re- 
gardless of application interest in them). Mugshot logs 
mousemove by default, but since few applications use 
mousemove handlers, the developer can disable its log- 
ging to reduce log size. In Section 4.2, we evaluate 
Mugshot’s log size for a drawing application that does 
use mousemove events. 

Games which have high rates of keyboard or mouse 
activity may generate many content selection annotation 
events. Generating these annotations is expensive on IE 
since Mugshot has to experimentally determine the se- 
lection range (see Section 3.1.8). Furthermore, games do 
not typically care about the selection zones that their GUI 
inputs may or may not have created. Thus, for games 
with high event rates like Spacius [16], we disable anno- 
tations for content selection. 


3.2 Replay 


Compared to the logging process, replay is straight- 
forward. The most complexity arises from replaying 
load events, since JavaScript code cannot modify the 
network stack and stall data transmissions to recreate 
logging-time load orderings. Thus, Mugshot coordinates 
load events with a transparent caching proxy that the de- 
veloper inserts between his web server and the outside 
world. 

In addition, Mugshot must also shield the replaying 
execution from new events that arise on the developer 
machine, e.g., because the developer accidentally clicks 
on a GUI element in the replaying application. Without a 
barrier for such new events, the replaying program may 
diverge from the execution path seen at logging time. 


3.2.1 Caching Web Content at Logging Time 


When a user fetches a page logged by Mugshot, the 
fetch is mediated by a transparent Mugshot proxy. The 
proxy assigns a session ID to each page fetch; this ID 
is stored in a cookie and later written to the Mugshot 
log. As the proxy returns content to the user, it updates a 
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per-session cache which maps content URLs to the data 
that was served for those URLs during that particular ses- 
sion. Optionally, the proxy can rewrite static <html> 
and <frame> declarations to include the Mugshot li- 
brary’s <script> tag. 


3.2.2 Replaying Load Events 


At replay time, the developer switches the proxy into 
replay mode, sets the session ID in his local Mugshot 
cookie to the appropriate value, and directs his web 
browser to the URL of the page to replay. The proxy 
extracts the session ID from the cookie, determining the 
cache it should use to serve data. The proxy then be- 
gins to serve the application page, replacing any static 
<script> references to the logging Mugshot library to 
references to the Mugshot replay library. 

During the HTML parsing process, browsers load and 
execute <Script> tags synchronously. The Mugshot 
replay library is the first JavaScript code that the browser 
runs, so Mugshot can coordinate load interleavings with 
the proxy before any load requests have actually been 
generated. During its initialization sequence, Mugshot 
fetches the replay log from the developer’s log server and 
then sends an AJAX request to the proxy indicating that 
the proxy should only complete the loads for subsequent 
non-<script> objects in response to explicit “release 
load” messages from Mugshot. 

The rest of the page loads, with any <scripts> 
loading synchronously. The browser may also launch 
asynchronous requests for images, frame source, etc. 
These asynchronous requests queue at the proxy. Later, 
as the developer rolls forward the replay, Mugshot 
encounters load events for which the corresponding 
browser requests are queued at the server. Before signal- 
ing the proxy to transmit the relevant bytes, Mugshot in- 
stalls acustom DOM 2 1oad handler for the target DOM 
node so that it can determine when the load has finished 
(and thus when it is safe to replay the next event). 


3.2.3. The Replay Interface 


At replay initialization time, Mugshot places a semi- 
transparent <iframe> overlaying the application page. 
This frame acts as a barrier for keyboard and mouse 
events, preventing the developer from issuing events to 
the replaying application that did not emerge from the 
log. We embed a VCR-like control interface in the bar- 
rier frame which allows the developer to start or stop 
replay (see Figure 3). The developer can single-step 
through events or have Mugshot dispatch them at fixed 
intervals. Mugshot can also try to dispatch the events in 
real time, although “real-time” playback of applications 
with high event rates may have a slowdown factor of 2 to 
4 times (see Section 4.2.2). 
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Figure 3: Replaying Tetris (VCR control on left) 


Whenever Mugshot replays an event, it can optionally 
place a small, semi-translucent square above the target 
DOM node. These squares are color-coded by event type 
and fade over time. They allow the developer to visu- 
ally track the event dispatch process, and are particularly 
useful for understanding mouse movements. 


3.2.4 Replaying Non-load Events 


Replaying events is much simpler than logging them. 
To replay a non-load DOM event, Mugshot locates the 
target DOM node and dispatches the appropriate syn- 
thetic event. For low-fidelity events (83.1.8), Mugshot 
also performs the appropriate fix-ups using annotation 
records. To replay text selection events, Mugshot recre- 
ates the appropriate range object and then uses a browser- 
specific call to activate the selection. 

To replay timeout and interval callbacks, Mugshot’s 
initialization code interposes on setTimeout () and 
setInterval(). The interposed versions tag each 
application-provided callback with an interrupt ID and 
add the callback to a function cache. This cache is 
built in the same order that IDs were assigned at log- 
ging time, so replay-time interrupt IDs are guaranteed to 
be faithful. Mugshot does not register the application- 
provided callback with the native interrupt scheduler. In- 
stead, when the log indicates that an interrupt should 
fire, Mugshot simply retrieves the appropriate function 
from its cache and executes it. Mugshot interposes 
on the cancellation functions clearTimeout () and 
clearInterval (), but the interposed version are no- 
ops—once the application cancels an interrupt at logging 
time, Mugshot will never encounter it again in the log. 

Mugshot also interposes on the XMLHttpRequest 
constructor. Much like interrupt replay, Mugshot stores 
AJAX callbacks in a function cache and executes them 
at the appropriate time. Mugshot updates each synthetic 
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AJAX object with the appropriate log data before invok- 
ing the application AJAX handler. 

By interposing on the Date () constructor, Mugshot 
forces time queries to read values from the log. The log 
also contains the initialization seed used by the random 
number generator at capture time. Mugshot uses this 
value to seed the replay-time generator. This is sufficient 
to ensure that subsequent calls to Math. random () re- 
play faithfully. 


3.3. Limitations 


Mugshot uses a caching proxy to reproduce the load 
events in the log. If an application fetches external con- 
tent that does not pass through the proxy, Mugshot can- 
not guarantee faithful replay of its data or its load time. 
Thus, these ill-defined loads can ruin the fidelity of the 
entire replay. 

As described in Section 3.1.8, Mugshot must use an- 
notation records to properly replay GUI events involving 
drop-down boxes. Although the replay is correct from 
the perspective of a <select> tag’s internal JavaScript 
state, both IE and Firefox refuse to visually drop-down 
a drop-down list in response to synthetic events. How- 
ever, after Mugshot applies the annotation event, the vi- 
sual display of the tag adjusts to indicate the appropriate 
selection. 

Web applications typically fail because of unexpected 
interactions between HTML, CSS, and event-driven 
JavaScript code. Mugshot logs all of these application 
inputs and the associated event streams. In many cases, 
this is sufficient to recreate a bug; the log-time browser 
need not be the exact same type and version as the replay- 
time browser. However, some bugs arise from an in- 
teraction between the application and a specific browser 
type and version. In these cases, it is crucial for the log- 
time and replay-time browsers to be the same. Mugshot’s 
client-side component records the identity of its log-time 
browser (e.g., Firefox 3.5) so that the developer can run 
the same browser at debug time. However, even this may 
be insufficient to recreate some bugs—users can install 
browser plug-ins or change local configuration state, and 
the existence of that particular local state may be the root 
cause of a bug. Since this type of client state cannot be 
introspected by JavaScript code, Mugshot cannot reliably 
reproduce these kinds of bugs. 

If a web page contains multiple frames, proper log- 
ging requires each frame to contain the Mugshot logging 
<script> tag. Similarly, at replay time, each frame 
must contain Mugshot’s replay script. The replay proxy 
can automatically instrument statically declared frames. 
However, if a page dynamically creates frames, e.g., us- 
ing JavaScript, the developer is responsible for inserting 
the appropriate Mugshot tags. 


4 Evaluation 


For Mugshot to be useful, it must be unobtrusive at 
logging time and faithful to the original program exe- 
cution at replay time. If event logging makes programs 
sluggish, users will reject Mugshot-enabled applications; 
if Mugshot cannot reproduce real bugs, it provides no 
utility to application developers. In this section, we 
run Mugshot on a variety of microbenchmarks and real 
JavaScript programs, demonstrating that user-perceived 
logging overhead is no worse than 6.8% for applications 
with high event rates. We provide two examples of bugs 
that Mugshot can log and then reproduce at replay time. 
We also demonstrate that replay speed is not unaccept- 
ably slow, and that events logs grow no faster than 100 
KB per minute in applications with high event rates. 

All experiments ran on an HP xw46000 workstation 
with a dual-core 3GHz CPU and 4 GB of RAM. We 
tested Mugshot performance inside two browsers, Fire- 
fox v3.5.3 and IE8 v8.0.6001. When stripped of extrane- 
ous white space and comments, Mugshot’s logging code 
was 46 KB and its replay code was 35 KB. Note that only 
the logging code must be shipped to end users, and only 
the replay code must be shipped to debugger machines. 


4.1 Microbenchmarks 


To explore the basic computational overheads of log- 
ging and replay, we inserted Mugshot into several mi- 
crobenchmark applications. For each microbenchmark, 
we compared its run time without Mugshot support to its 
run time during logging and replay. We used the follow- 
ing test suite: 

e DeltaBlue is a constraint solver from Google’s V8 
JavaScript benchmark suite [15]. The benchmark 
is computationally intensive, but it has no user in- 
terface, and it does not internally generate loggable 
events. Thus, DeltaBlue’s Mugshot-enabled run- 
ning time reflects any penalty that Mugshot imposes 
on straightline computational workloads. 

e The Date benchmark simply called new Date () 
5000 times. 

e In baseline and logging mode, the click bench- 
mark dispatched 5000 synthetic mouse events as 
quickly as possible. As explained in Section 3.1.7, 
Mugshot normally does not log synthetic GUI 
events since they are deterministic. However, for 
the logging part of the benchmark, we forced 
Mugshot to log the synthetic mouse clicks. At re- 
play time, Mugshot simply tried to dispatch these 
logged events as quickly as possible. 

e The setTimeout benchmark issued 25 calls 
to setTimeout (function(){}, 0). If the 
computational overheads of logging and replaying 
a null function are high, fewer interrupts can be is- 
sued per unit time. 
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Figure 4: Microbenchmark slowdown due to logging and replay. 


Figure 4 shows the Mugshot-induced slowdowns for the 
test suite. Each graph depicts a benchmark’s execu- 
tion time in baseline, logging, and replay scenarios; per- 
formance is normalized with respect to baseline perfor- 
mance. Each result represents the average of 10 trials, 
and in all cases, standard deviations were less than 5%. 

As expected, Figure 4(a) shows that Mugshot intro- 
duced no overhead for a purely computational workload. 
Figure 4(b) demonstrates that logging activity did not de- 
lay interrupt scheduling in Firefox and IE. However, re- 
play did introduce a 50% slowdown on IE; we are still 
investigating the reasons for this behavior. 

As shown in Figure 4(c), logging penalties slowed the 
click benchmark by a factor of 3 on Firefox and 4.6 on 
IE. The overhead primarily arose from the complex logic 
needed to properly log mouse events on <select> ele- 
ments (Section 3.1.8). During replay, the click bench- 
mark slowed by a factor of 1.6 on Firefox and 2.1 on IE. 
Most of the slowdown was caused by regular expression 
computations during the parsing of the each click log 
entry. An optimized version of Mugshot would parse the 
entire log at replay initialization time. However, as we 
show in Section 4.2, our unoptimized Mugshot can al- 
ready replay real applications at a tolerable rate. 

Figure 4(d) shows the Mugshot penalties for the Date 
microbenchmark. The slowdown factors are large, rang- 
ing from 8.4 to 23.9. The reason is that fetching the 
current date in the baseline case is extremely fast— it 
merely requires a read of a native browser variable. User- 
level JavaScript code is much slower than native code, 
so Mugshot’s Date () logging introduces high relative 


NSDI ’10: 7th USENIX Symposium on Networked Systems Design and Implementation 


overheads. Fortunately, Section 4.2 shows that real ap- 
plications do not issue time queries at a high enough rate 
to expose Mugshot’s logging overhead. 


4.2 Application Examples 


To evaluate Mugshot’s performance in more realistic 
conditions, we examined its logging and replay over- 
heads for seven applications. Three of the applications 
were games with varying rates of event generation. 

e DOMTRIS [26] is a JavaScript implementation of 

the classic Tetris game. 

e Pacman [6] is unsurprisingly a Pacman clone. 

e Spacius [16] is a 2D side-scrolling space shooter. 
These games implement many of their animations using 
interrupt callbacks, so frame rates (and the user experi- 
ence) will suffer if Mugshot introduces too much latency 
to the critical path of interrupt dispatch. 

We also evaluated Mugshot’s performance on four 
non-games: 

e The Applesoft BASIC interpreter [2] parses and 
runs BASIC programs, providing an emulated joy- 
stick and graphical display. 

e NicEdit [19] is a WYSIWYG text and HTML edi- 
tor. 

e Painter [24] is a simple drawing program. 

e The JavaScript shell [25] provides a command- 
line interface for manipulating the DOM and 
application-defined JavaScript state. 

These programs stress Mugshot’s handling of form, key, 
and selection events. Painter also makes use of the 
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Figure 5: Growth rate of logs (kilobits per second). 


mousemove event, so we configured Mugshot to log 
those events for this application. 

We evaluated Mugshot’s performance in the Firefox 
browser for each of the seven applications. However, 
we only evaluated Mugshot’s IE performance for the 
first four applications. The latter three applications do 
not replay correctly in IE; they trigger quirks in IE’s 
form/selection event model that Mugshot does not cur- 
rently handle. 


4.2.1 Log Sizes 


Figure 5 depicts the growth rate of Mugshot’s log for 
each application, showing the size of the verbose log 
and the compact log. The verbose log has a human- 
friendly format; among other things, it contains a dump 
of the page’s HTML at load time, and it explicitly tags 
each event with an easy-to-understand string represent- 
ing the event type and its parameters. In our experiences, 
just reading the verbose log can provide a human de- 
bugger with invaluable insights about program operation. 
The compact log discards the beautifications of the ver- 
bose log and represents events and their parameters using 
short status codes. The compact log is also compressed 
using the LZW algorithm with a window size of 200. 

Figure 5 shows that the rates of uncompressed log 
growth varied widely, from roughly 10 Kbps (Tetris) to 
106 Kbps for Painter on Firefox and 95 Kbps for the BA- 
SIC interpreter on IE. Figure 5 also shows that Mugshot’s 
log compression is effective, with a worst case com- 
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Figure 6: Interrupt rates. 


pressed growth rate of 15.7 Kbps for the Painter ap- 
plication. Painter’s logs were comparatively large be- 
cause Mugshot had to log frequent mousemove events. 
Spacius generated the second highest growth rates (10.9 
Kbps KB on Firefox and 10.4 Kbps on IE). As we dis- 
cuss in the next section, this was due to Spacius’ high 
rate of interrupt events. 


4.2.2 Logging and Replay Overheads 


For Mugshot to be practical, its logging overhead must 
have a minimal impact on the user experience. It is less 
important for Mugshot to be able to replay application 
traces in real time. However, replaying should not be so 
slow that debugging is painful for a developer. 

Games update the screen in response to GUI events 
like mouse clicks. However, for graphically intense 
games, most of the screen updates are driven by timer 
interrupt callbacks. The dispatch rate of these call- 
backs provides a natural metric for Mugshot-induced 
slowdowns—the more overhead that Mugshot creates, 
the slower these callbacks execute, and the more slug- 
gish the application appears. 
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Figure 6 shows the interrupt dispatch rate for the three 
games on IE and Firefox. As with the microbenchmark 
results, we show dispatch rates for baseline, logging, 
and replay scenarios. To measure these rates, we manu- 
ally identified all of the interrupt handlers in each game, 
adding a single line of code to each handler which incre- 
mented a global counter. At the end of 30 seconds, we 
divided this counter by the elapsed wall time to get the 
number of interrupts dispatched per second. 

Figures 6(a) and 6(b) show that for applications with 
low to moderate interrupt rates, the interrupt dispatch 
rate was unchanged, 1.e., Mugshot logging introduced 
no overhead. Compared to Tetris and Pacman, Spacius 
had a very high interrupt rate, executing about 100 call- 
backs per second. Figure 6(c) shows that for this ap- 
plication, Mugshot’s logging overhead reduced dispatch 
rates by 0.8% on Firefox and 6.8% on IE. However, 
Spacius gameplay did not seem qualitatively degraded 
during logging on either browser. 

Figure 6 shows that dispatch rates at replay time can 
decrease by as much as 75% in the case of a Spacius re- 
play on Firefox. This time dilation is certainly tolerable, 
but as mentioned in Section 4.1, we could improve the 
replay rate by optimizing our log parsing. 


4.2.3 Capturing Real Bugs 


Mugshot’s goal is to capture application runs and re- 
play them on developer machines. An important applica- 
tion of replay mode is the recreation of buggy application 
states. Armed with the event interleavings that generate 
a program fault, a developer can use powerful localhost 
debuggers to step through the log and inspect the appli- 
cation’s state after each event. 

Since we lacked detailed changelogs for the seven ap- 
plications described in Section 4.2, we could not inten- 
tionally undo a bug fix and then see whether Mugshot 
could successfully log and replay a problematic event se- 
quence. However, while performing the experiments in 
Section 4.2, we did encounter bugs in two of the applica- 
tions, both of which Mugshot could log and replay. 

The first bug involved a display glitch in the Tetris pro- 
gram which we were able to characterize in detail using 
Mugshot. A Tetris game terminates if a falling block nes- 
tles amongst the static blocks in a way that causes the 
overall block structure to exceed a maximum allowable 
height. When this happens, the game should render the 
bottom part of the most recent piece but leave the top 
part clipped, since this part extends above the playable 
area. However, depending on the shape of the most re- 
cent piece and the preexisting block structure, the Tetris 
implementation we tested would incorrectly render the 
final block structure, scattering the constituent blocks of 
the final piece in arbitrary positions, sometimes overwrit- 
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ing preexisting blocks. Figure 3 shows an example of this 
bug. The final piece is 1 block by 4 blocks, but when its 
stacking causes the game to end, two of its blocks myste- 
riously materialize in the square formation at the bottom 
of the screen. 

The second bug involved the Painter application. To 
draw a rectangle in this program, the user clicks on the 
“Rectangle Tool’, then drags the pointer across the can- 
vas with the mouse button down. If the user selects the 
“Rectangle Tool’ and just single-clicks on the drawing 
area, no rectangle should be drawn. However, after sin- 
gle clicking, an expanding rectangle will appear as the 
user moves the mouse. This makes the user think that 
he is, in fact, drawing a rectangle. However, when the 
mouse is clicked again, the rectangle suddenly disap- 
pears. 

We captured, replayed, and diagnosed both bugs us- 
ing Mugshot. For both applications, the compressed log 
which captured the bug was under 11 KB in size. Trans- 
mitting such an error report to developers would be ex- 
tremely fast, even on a slow connection. 


5 Privacy 


Mugshot provides developers with an extremely de- 
tailed log of user behavior. Some might worry that this 
leads to an unacceptable violation of user privacy. Such 
privacy concerns are valid. However, web sites can (and 
should) provide an “opt-in” policy for Mugshot logging, 
similar to how Windows users must willingly decide to 
send performance data to Microsoft [14]. 

We also emphasize that Mugshot is not a fundamen- 
tally new threat to online privacy. Web developers al- 
ready have the ability to snoop on users to the extent al- 
lowed by JavaScript, and to send the resulting data back 
to their own web servers. Indeed, many web sites already 
perform a crude version of event logging using web anal- 
ysis services like CrazyEgg [9] that build heat maps of 
click activity on a particular page. In all cases, the scope 
of JavaScript-based snooping is limited by the browser’s 
cross-site scripting policies. From the browser’s perspec- 
tive, Mugshot is not an exception: it is subject to exactly 
the same restrictions designed to thwart malware. These 
restrictions prevent all programs—including Mugshot— 
from snooping on a frame owned by one domain and 
sending that data to a different domain. 


6 Conclusions 


As web applications have grown in _ popularity, 
browsers have shipped with increasingly powerful 
JavaScript debuggers. These tools are extremely useful 
for introspecting applications that are running on a local 
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development machine. However, they cannot be used to 
examine program contexts which reside on remote ma- 
chines. When regular end users encounter application 
bugs, they will not inspect the application using their 
browser’s advanced debugger. At best, they will send 
a bug report which describes their problem using natural 
language. At worst, they will do nothing and simply be 
frustrated. Ideally, users would have a convenient way to 
give developers the precise event sequence that led to a 
buggy application state. The developer could then recre- 
ate the execution run and use his knowledge of the code 
to diagnose the problem. 

To address these issues, we created Mugshot, a 
lightweight framework for capturing JavaScript applica- 
tion runs and replaying them on different machines. Ex- 
periments show that Mugshot introduces little overhead 
at logging time. For applications like games which gen- 
erate many events, Mugshot slows execution speeds by 
6.8% in the worst case. Mugshot event logs grow at a 
reasonable rate, requiring 20-80 KB per minute of ap- 
plication activity. Using Mugshot’s replay mode, we 
have successfully recreated bugs in two real applications. 
Mugshot’s logs also support usability investigations and 
traditional click analytics. 
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Abstract 


This paper proposes to exploit physical layer information 
towards improved rate selection in wireless networks. 
While existing schemes pick good transmission rates, 
this paper takes a step further towards computing the 
optimal bit rate. The main idea is to capture the chan- 
nel behavior through symbol level dispersions, and “re- 
play” these dispersions on different rate encodings of 
the same packet. The “replay” action can be emulated 
at the receiver without requiring the transmitter to send 
the packet at every other rate. The maximum success- 
ful rate is likely to be the optimal rate of the received 
packet, and assuming that the channel remains coherent, 
the same rate can be prescribed for the next transmis- 
sion. We design, implement, and evaluate this idea over 
a small testbed of USRP hardware and GNURadio soft- 
ware. Our proposal, called AccuRate, predicts a packet’s 
optimal rate 95% of times when the packet is received 
correctly. When the packet is received in error, AccuRate 
computes its optimal rate with 93% accuracy. In terms of 
throughput, we show that AccuRate improves over the 
state-of-the-art scheme SoftRate by around 10%, and is 
reasonably close to the optimal. 


1 Introduction 


Rate estimation is an important problem because it 
directly translates to throughput. The difficulty in rate 
estimation stems from channel fluctuations — the optimal 
rate quickly becomes stale, requiring a fresh round of 
estimation [1-3]. WiFi rate control is performed at the 
link layer, and hence, must operate on the granularity of 
packets. Approaches such as ARF [4], RRAA [5], and 
SampleRate [2] continuously track the success/failure 
of packets, and employ statistical prediction methods to 
select the appropriate rate. To improve responsiveness 
to channel fluctuations, alternate schemes have explored 
the use of SNR for rate selection. RBAR [6] and 
OAR [7] were the first-generation schemes that utilized 
RTS/CTS to exchange SNR values. However, with 
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recent consensus to turn off RTS/CTS, new schemes are 
recording historical SNRs and deriving a rate-versus- 
SNR relationship from it [8,9]. While this improves 
performance, continuously refreshing the SNR for every 
rate 1s often difficult [9]. Moreover, the rate-vs-SNR 
relationship changes with different propagation envi- 
ronments, especially when the channel changes quickly 
over time [1]. Therefore, although practical SNR-based 
schemes are reasonably good at slower time-scales, they 
lack the agility to achieve per packet rate adaptaion in 
dynamic wireless environments. 


This paper proposes to exploit physical layer informa- 
tion (such as symbol level dispersion on a constellation 
space) to improve the accuracy of rate selection. We 
show that such PHY layer information can be derived 
from a received packet, and then used to compute the 
optimal rate at which that packet should have been trans- 
mitted. Although the optimal rate is computed in retro- 
spect, it can be valuable for guiding the transmission rate 
of subsequent packets. Moreover, symbol level informa- 
tion can discriminate between losses due to fading and 
interference, further assisting in link layer retransmission 
strategies. Our ideas are consolidated into a constella- 
tion based rate estimation scheme, called AccuRate. We 
show that the improvements from AccuRate are consis- 
tent over diverse wireless environments. 


AccuRate’s main idea is intuitive. Given that the PHY 
layer encodes a sequence of bits into a symbol on the 
constellation space, AccuRate looks at the dispersion 
between the transmitted and received symbol positions. 
Small dispersions indicate that the communication link 
is strong, and perhaps capable of supporting higher rates 
than the one used. By comparing these dispersions to the 
permissible dispersions at different bit rates, AccuRate 
can precisely derive the maximum rate the packet could 
have been transmitted at. Even when the packet fails, 
AccuRate extracts known parts of the packet (preamble 
and postamble [10]), and estimates the appropriate rate 
from them. Of course, this is a retrospective analysis of 
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a just-concluded transmission. However, as argued ear- 
lier, knowing the optimal rate of a received packet is a 
valuable primitive for rate control algorithms. The Accu- 
Rate receiver prescribes this rate to the transmitter, which 
in turn uses it for the next transmission. So long as the 
channel remains coherent between two consecutive pack- 
ets, AccuRate achieves a near-optimal rate selection ac- 
curacy. 


This paper is not the first to use PHY layer informa- 
tion towards rate estimation. Recently, authors in [1] 
proposed SoftRate, a scheme that uses PHY layer con- 
fidence values to estimate a packet’s bit error rate (BER). 
By comparing the BER against an empirically generated 
lookup table, the transmitter picks a “good” bit rate for 
subsequent transmissions. While SoftRate makes a valu- 
able contribution, we show that there is room for im- 
provement. Specifically, we show that by directly op- 
erating on symbol constellations, AccuRate can “jump” 
to the optimal rate in one step, while eliminating the 
reliance on empirical measurements. Moreover, Ac- 
cuRate’s approach scales to arbitrarily high bit rates, 
and does not require large gaps between the consecutive 
rates. Experiments performed in a wireless channel sim- 
ulator [11] (where the channel conditions can be repeated 
for fair comparison) demonstrates that AccuRate reliably 
selects the optimal rate. Similar experiments on a pro- 
totype USRP testbed show consistent throughput gains 
under various wireless environments. Together, these 
results confirm that AccuRate advances the state of the 
art through PHY-aware rate estimation. AccuRate’s key 
contributions can be summarized as follows. 


e Identify the opportunity of rate estimation us- 
ing symbol dispersion at the PHY layer. We 
verify our ideas through measurements on the 
USRP/GNURadio platform. The findings offer new 
insights for further research at the link layer. 


e Develop a constellation based rate estimation 
scheme (AccuRate) that “jumps” to the appro- 
priate rate. The wireless channel manifests itself 
through symbol level dispersions. By “replaying” 
the dispersions on packets at different rates, Accu- 
Rate is able to identify the best bit rate of a packet. 
This bit rate is prescribed for future transmissions. 


e Implement and evaluate AccuRate on a USRP 
testbed, and on a emulation platform composed 
of USRPs and a wireless channel simulator. Re- 
sults from 25 hours of testbed experimentation 
shows consistent improvement in performance over 
existing schemes. Emulation results (enabling ex- 
periments under controllable and repeatable chan- 
nel conditions) exhibit similar trends. 
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2 Related Work 


Perfect bit rate selection in wireless networks is an elud- 
ing problem that has been researched extensively in the 
past [1, 2,4-9, 12-18]. Existing schemes have been 
broadly classified as frame-level or SNR-based, and has 
been well surveyed in [1]. Here, we touch upon only the 
recent works relevant to AccuRate. 


History based: SampleRate [2] by Bicket adapts trans- 
mission rate by periodically probing the channel with 
packets at various bit rates. The idea is to adapt to chang- 
ing channel conditions and minimize the overall trans- 
mission time for the packets. In RRAA [5] the authors 
propose faster rate estimation than SampleRate by using 
loss information from short frame windows. Frame er- 
ror history based schemes like SampleRate and RRAA 
do not distinguish between fading and collision which 
is significant for rate estimation. This class of schemes 
are also slow to converge, and may not converge at all, 
if channel conditions change frequently. AccuRate dis- 
tinguishes between fading and collisions and has a one- 
packet convergence-time to estimate the best rate sup- 
ported by the channel. 


SNR-based: Two recent SNR-based schemes take a 
cross layer approach to perform rate estimation. In 
[9], Camp and Knightly show that SNR-BER relation- 
ships change with the operating environment and there- 
fore need training to operate in a particular environment. 
They also compare existing SNR-based schemes with 
SNR-trained schemes to show that trained SNR schemes 
perform considerably better. In [17], the authors demon- 
strate the utility of adaptive modulation per frequency 
band. The variation of channel characteristics across fre- 
quency sub-bands accentuates the effect in ultra wide 
band regimes which will benefit the most from such 
schemes. To perform well these schemes need in-situ 
training for each environment. AccuRate does not need 
any training or information about the environment. 


Collision vs. Fading: Collision detection has been an 
area of active research and lately several schemes have 
been proposed [19-23]. The scheme in COLLIE [20] al- 
lows a transmitter to distinguish between a fading and a 
collision loss by having the receiver send back the er- 
roneously received packet. This allows the sender to 
identify the corrupt bits (via comparison with the orig- 
inal packet), and then analyze the cause of failure by an- 
alyzing the corruption patterns. Of course, the scheme 
depends on proper packet reception from the receiver in 
a timely manner. In [22], the authors propose a way to 
distinguish between collisions and fading, and adapt rate 
based only on the errors due to fading. This scheme is 
still history based and suffers from the same patholo- 
gies associated with other similar schemes. The use 
of OFDM symbol dispersions was shown in [23] as a 
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technique to distinguish between collision and fading. 
Our work goes beyond making this distinction by using 
known dispersions to select the correct rate. 


SoftRate: The closest proposal to AccuRate is SoftRate 
[1], which was the first to exploit PHY layer information 
for rate estimation. We therefore focus on explaining the 
differences between SoftRate and AccuRate. SoftRate 
achieves high quality rate estimation using a cross layer 
approach, but we believe there is room for improvement. 
Specifically, SoftRate estimates the rate supported by the 
channel based on the BER of the received packet. The 
BER is an average of SoftPHY confidence values, com- 
puted from the dispersion of the received symbols from 
their nearest constellation symbols. SoftRate employs a 
heuristic to predict the BER at other bit-rates using the 
BER estimate at a given bit rate!. While this heuristic 
can effectively indicate when the rate must decrease to 
the next-lower bit rate (or increase to the next-higher bit 
rate), the ability to jump directly to the best rate is lim- 
ited. In contrast, AccuRate’s ability to replay the chan- 
nel distortion on all possible rates facilitates selection of 
the best rate in one step. The replaying mechanism is 
expected to scale to bit rates at potentially finer granu- 
larity. However, unlike SoftRate, the hardware cost and 
implementation complexity may be excessive. To bal- 
ance performance and complexity, one may envision a 
combination of AccuRate and SoftRate — a topic of fu- 
ture research. 


3 Background and Observations 


We present some background material on PHY layer 
encoding/decoding of bits with different modulation 
schemes. Building on this understanding, we observe 
that the extent of signal distortion due to channel fading 
is independent of the modulation scheme. We validate 
this through USRP/GNURadio measurements, and use it 
as a pivot for subsequently proposed ideas. 


* Received Signal 
* Signal Constellation 





Figure 1: Symbol constellation for 16QAM: (a) Each 
symbol corresponds to a 4-bit sequence. (b) Symbols 
received after suffering channel-induced dispersions. 


The heuristic exploits the empirical observation that, under a given 
SNR, adjacent bit rates experience a factor of 10 difference in BER. 


3.1 PHY Layer Symbol Constellations 


The PHY layer encodes a sequence of bits into a PHY 
symbol which is represented by a position on a 2D 
complex plane called the constellation diagram. Figure 
l(a) shows an ideal constellation diagram from 16-ary 
quadrature amplitude modulation (16QAM). If the trans- 
mitter wishes to send a bit sequence “OQO0QO”, it sets the 
In-Phase (x-axis) and Quadrature (y-axis) to a value of 
<—3,—3>. The receiver recovers the J and Q values 
after demodulation, and plots each symbol on the JQ 
plane. Since the channel distorts the transmitted signals, 
the received symbol positions get dispersed from their 
ideal positions. Let r; be the received symbol position 


and s; be its ideal symbol position. We define its disper- 
sion as d; = es . This is essentially the Error Vector 
Magnitude (EVM) [24, 25], but for ease of understand- 
ing, we refer to it as dispersion. Figure 1(b) shows an 


example of the dispersed symbols at the receiver. 





To decode the symbols, for each received symbol po- 
sition, 7;, the receiver guesses the corresponding ideal 
symbol position, s;. A simple method is to pick the sym- 
bol that is closest to the received symbol position 7;. In 
other words, there is a tile associated with each symbol 
in the constellation. When a symbol s; is received cor- 
rectly, its received position falls within the symbol’s tile, 
i.e., 7; € tile(s;). When all the symbols in a packet 
are received correctly, the corresponding bits will pass 
the CRC check and the packet is handed to the upper 
layer. 


With channel fading or interference from nearby trans- 
missions, the received symbol position r; may be quite 
far away from the transmitted symbol position s;. The re- 
ceived position 7; may even fall outside the tile of s;, i.e., 
r; € tile(s;). Then, 7; will be closer to another symbol 
position than s;, misguiding the receiver to believe that 
some other symbol was transmitted instead of s;. This 
error will be caught later when the CRC check on the 
packet fails. Of course, channel coding techniques, such 
as forward error correction (FEC), may be effective in 
correcting some errors in demodulation. If the number 
of errors are large, even channel coding may not be ade- 
quate to recover the packet. 


3.2 Relation between Transmission Rate 
and Symbol Constellation Density 


For ease of explanation, let us ignore channel coding for 
now and assume that each symbol encodes the actual 
bits from the packet. Observe that a higher transmis- 
sion rate is realized by encoding a longer bit-sequence 
on a symbol. Thus, if one increases the length of the se- 
quence from 2 to 4 bits per-symbol, the constellation di- 
agram must also accommodate a greater number of sym- 
bols (from 4QAM to 16QAM). In other words, the den- 


NSDI 710: 7th USENITX Symposium on Networked Systems Design and Implementation 177 


178 























mp . > om > » 

> De bbe - 
Foe on Bo he 

oe > & Bm : 

é ® os 8 


mp - 5: 

ee fb 

PS ke 
be, he 


Se 


> 
oe 
: 
a 
> ‘: > 
Pre ia i id > 
a? u 
erm OP 
fk 2 >* m 
: 
y, ogt 
wee 
“e 
: 




















Figure 2: Symbol density increases with increasing data rates (BPSK, QPSK (or 4QAM), 16QAM, 64QAM). 


sity of symbols in the constellation diagram increases at 
higher rates as shown in Fig. 2. Since increased den- 
sity implies shorter distance between neighboring sym- 
bols, the received packet is more susceptible to errors at 
higher rates when the channel is weak. Figure 3 confirms 
this by showing that the maximum tolerable BPSK error 
(|s; — r;|) can be twice that of QPSK (or 4QAM), and 
four times that of 16QAM. This well-known observation 
will underlie the design of AccuRate. 


Testbed 





64QAM /—=—- 
16QAM }-—-——--—-- 
QPSK 


BPSK 











0 02 04 06 08 1 #12 14 416 = 1.8 
Scalar Distance between Ideal and Received Symbol 


Figure 3: Lower rates can tolerate higher magnitude of 
symbol dispersion 


3.3. Relation between Dispersion due to 
Channel Fading and Bit Rate 


We now demonstrate that symbol dispersion is not influ- 
enced by the modulation scheme (or transmission rate), 
and is only a function of the channel. We transmit data 
from a static USRP sender to a static USRP receiver us- 
ing 2, 4, 16, and 64 QAM. We maintain as much coher- 
ence in the channel as possible (by keeping the physical 
environment static), and transmit small packets repeat- 
edly using different modulation schemes in a round robin 
manner. For every received symbol, we calculate its dis- 
persion from the correct constellation symbol’. Figure 4 
plots the CDF of symbol dispersion magnitude for each 
modulation scheme for packets transmitted in one round. 


The correct constellation symbol is known because the transmitted 
packet is known in our experiments. Thus, even when a packet fails, 
we can still compute the correct dispersions. 
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Almost-identical curves provide evidence that the disper- 
sions are independent of the symbol constellation, and 
therefore the transmission rate. More detailed experi- 
mental evidence is presented in [25]. 
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Figure 4: CDF of symbol dispersion magnitude for pack- 
ets transmitted with different modulation schemes. Not 
all packets were received correctly, but their dispersions 
could be computed offline using the (known) transmitted 
packet. 


These observations enable us to model the channel be- 
havior based on the dispersion of known symbols at the 
receiver. The receiver can then conduct a what-if analysis 
by “replaying” the channel on a packet encoded at differ- 
ent rates. For instance, Fig. 5(a) shows the dispersion 
of symbols when a packet was transmitted using 4QAM. 
Given that the dispersion is independent of the modu- 
lation, the receiver can check whether a higher modu- 
lation such as 16QAM with denser constellation could 
have tolerated the same level of dispersion. In other 
words, 1|6QAM is feasible if all the received symbols in 
each 4QAM quadrant can be accommodated in a smaller 
16QAM tile (drawn with dashed grids) as in Fig. 5(b). 
The original 16QAM grid, as shown in Fig. 7 has been 
shifted and superimposed on Fig. 5(b) for the purpose 
of demonstration. In this example, 16QAM 1s not feasi- 
ble since some received symbols spill out of their correct 
tile. More generally, this shows that the outcome of a 
16QAM transmission may be predicted without actually 
transmitting the packet over the air. Repeating this over 
all possible bit rates will reveal the best possible rate for 
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Figure 5: Receiver checks if QPSK (4QAM) packet 
could have sustained higher bit rate (16QAM). 


this just-received packet. Such a retrospective analysis 
can guide us in subsequent rate control decisions. The 
details on how this hindsight is leveraged is described in 
the following sections. 


4 Determining The Optimal Transmission 
Rate in Retrospect 


We now explain how a receiver can determine from a 
received packet, what could have been the optimal rate, 
for transmitting that packet. A high level schematic of 
the procedure is depicted in Fig. 6. We present the rate 
computation method for three cases: (1) when the packet 
is received successfully, (2) when the packet fails due to 
fading, and (3) when the packet fails due to interference. 
We support our basic claims with measurements from the 
USRP testbed. 


4.1 In Case of Successful Packet Reception 


Let us first consider the case where a packet is suc- 
cessfully received. Since all the bits are decoded cor- 
rectly, the receiver is aware of all the transmitted sym- 


bols. Hence, it can compute the dispersion dij, between 


>The what-if analysis with “replay” operation is applicable even 
with channel coding as discussed later in Section 4.1.1 
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Figure 6: Flowchart of determining the optimal rate in 
retrospect. 


each transmitted symbol position s; and received sym- 
bol position r;. Assuming NV symbols in the packet, the 
channel can then be characterized by D, a sequence of 
dispersions, i.e., D= {dy, dy, . - dy}. Now, suppose 
the packet was transmitted at a bit rate of R. Given that 
it was received successfully, it is clear that the symbol- 
constellation density corresponding to R can tolerate the 
dispersion D. Now the question is what is the highest 
rate, R* (> R), at which the transmission would have 
been successful over a channel with dispersion sequence 


= 


1), 


Note that, as argued before, D is independent of the 
modulation used by the transmitter, i.e., the 2’th sym- 


bol gets dispersed by d; regardless of whether that sym- 
bol is from the constellation of BPSK, QPSK, 16QAM, 
or 64QAM. Consequently, the receiver can analyze the 
outcome of different modulations without requiring the 
transmitter to explicitly send the packet once per each 
modulation. The procedure to check whether a transmis- 
sion at a higher modulation would be successful is as fol- 
lows. For each symbol 7, we apply the dispersion vector 


d; on its ideal position s; in the constellation space and 
check if the resulting symbol position would still be cor- 
rectly decoded. If that position happens to be closer to 
some other constellation point (i.e., in some other tile), 
this constellation is too dense for this dispersion. In 
this manner, the most-dense constellation is chosen in 
which, for each symbol 7, d; is completely contained in 
the same tile. Figure 7 illustrates this checking opera- 
tion — a symbol received through BPSK modulation is 
being tested against a 4QAM and 16QAM constellation. 
In this example, the channel-induced dispersion can be 
tolerated by a 4QAM symbol, whereas a 1}6QAM sym- 
bol will not be decoded correctly as it’s received position 
falls in the wrong tile. Ignoring error coding, this implies 
that 16QAM is an inappropriate rate for transmitting this 
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packet. However, 4QAM may prove to be suitable, pro- 
vided all the symbols in the packet passes this test suc- 
cessfully. 


eee eK ep RK HK KK pb Re eH ee pe ee ee 











Figure 7: Computing the appropriate rate at which this 
packet reception would have been successful. In this case 
BPSK and QPSK will be successful where as 16QAM 
will not. 


4.1.1 Error Correction with Channel Coding 


We now introduce the role of channel coding in rate com- 
putation. Briefly, channel coding helps in error correc- 
tion by including redundant bits in the packet. Some 
symbols may fall in the incorrect tile on the constellation, 
but channel coding may still be able to correct them. This 
implies that coding can allow for a denser constellation, 
at the expense of a larger packet size. The net result is a 
new intermediate rate between the sparse and the dense 
constellation. To clarify with an example from 802.11g, 
4QAM results in 24Mbps. However, data rate 18Mbps 
can be achieved if the 4QAM modulation is combined 
with a 3/4 coding scheme. Thus, channel coding allows 
for additional data rates, offering finer-grained choices to 
the rate selection algorithm. 


With coding, our goal then is to precisely identify the 
best <modulation, coding> tuple at which this packet 
would have been successful. For this, the receiver con- 
siders every higher modulation scheme, and computes 
the fraction of symbols that would have been in error. 
Note that the higher the modulation, the more the num- 
ber of errors, and the larger the number of redundant bits. 
Suppose a modulation (7, needs 3/4 coding to correct 
errors and a higher modulation (7/2 requires more redun- 
dant 1/2 coding. Let R, and Ro be the rates correspond- 
ing to Md, and Mz respectively. Then, the effective rates 
(after accounting for the overhead due to coding) would 
be 2 Ry and 5 Ro. The higher effective rate is then chosen 


as the best rate in retrospect, 1.e., R* = max( 2 Ry , 5 Ro ). 


Besides offering bit rates at finer granularity, channel 
coding also allows for precisely computing the symbol 
dispersions. If a packet is known to pass the CRC check, 
the exact dispersion can be computed for all the symbols 
in that packet. These dispersions can then be recorded 
and replayed on higher-rate packet encodings to estimate 
the best bit rate. Observe that a packet may be success- 
ful even if some replayed dispersion causes the symbol to 
fall in an incorrect tile — channel coding may absorb these 
errors, similar to over-the-air reception. In other words, 
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so long as the dispersions are precisely known, the re- 
playing operation is no different from an actual transmit- 
receive operation. As will be clear from Section 5.2, even 
the same hardware chain may be reused for both the ac- 
tual reception and the replayed operation. This implies 
that, as long as a packet is received correctly, AccuRate 
can retrospectively compute its optimal bit rate, and use 
it for the subsequent transmission. 


4.2 In Case of Packet Loss due to Fading 


Let us now consider the case where a packet is not re- 
ceived correctly, and the receiver has to find the smallest 
rate reduction that would have resulted in a successful 
reception. This is more challenging because, unlike the 
above case, the receiver does not know the actual trans- 
mitted symbol s; for every received symbol position 7;. 
Since the packet failed the CRC check, some r; must 
have been outside the tile of the correct symbol s; and 
inside the tile of some other symbol. The receiver does 
not know which of the symbols are incorrectly decoded 
and so it can not precisely compute for each symbol the 


dispersion d; caused by the channel. 


Fortunately, each packet starts with a preamble, a glob- 
ally known sequence of bits, that the receiver uses to 
detect and synchronize onto a newly arriving signal. 
The receiver can utilize the preamble to estimate disper- 
sion [26]. Suppose the preamble consists of & symbols 


and their computed dispersions are dP ~~ dt _ rey de, 
We subject this sequence of & dispersions to the whole 
packet, i1.e., we compute the dispersion for 2’th symbol 
in the packet as d; = a. Given this set of d; vectors, 
we try to estimate the optimal rate R* using the same ap- 


proach as described earlier. 


The preamble’s symbol dispersions will only capture the 
channel behavior in the earlier parts of the packet. If the 
channel changes over time, the later changes will remain 
unquantified. Therefore, to better cope with channel vari- 
ations, a postamble [10] may be inserted at the end of the 


packet. Suppose the postamble also consists of /& sym- 


bols and their dispersions are ao ae, vee a 


Given the original packet size of NV. symbols, and a ran- 
domly generated packet of NV or more symbols, we sub- 
ject the i*” symbol in the random packet to a dispersion 
d;, computed as follows: If 7 < N’ then d; = a. else 


d; = a - The rationale is that the preamble is a bet- 


ter representative of the first half of the original packet 
duration, while the postamble is better for the rest of the 
duration. Also, when the random packet is encoded at a 
lower rate, the number of symbols increase. The postam- 
ble is likely to be a better estimate of the channel for 


these symbols as well. Once again, based on d; vectors 
obtained thus, we try to estimate the optimal rate of the 
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packet, R*. The overhead of postamble would be jus- 
tifiable if that leads to throughput improvements due to 
better rate estimation. 


4.3. In Case of Packet Loss due to Interfer- 
ence and Fading 


Rate selection must be approached somewhat differently 
when interference is the cause of packet failure. Under 
interference only, the transmitter should ideally backoff 
and transmit at the same rate. Under both interference 
and fading, the ideal approach is to backoff but transmit 
at a rate that accounts for the channel’s fading compo- 
nent. We approach this problem by looking at both the 
preamble and the postamble. The presence of a preamble 
and postamble in a packet offers multiple “glimpses” into 
how the channel varied during packet reception. Because 
both preamble and postamble are known, the receiver 
computes their respective dispersion vector sequences 
DP¥e and DPS. Using statistical methods, we com- 
pute the similarity of these sequences (detailed in Section 
6). If there was no interference, DP'® and DP°St would 
be similar and the loss 1s attributed to fading. Other- 
wise, depending on whether the interference overlapped 
with the preamble or postamble, DP® or DP°St would 
exhibit a higher dispersion than the other. If rate must 
be selected only in response to channel fading, then we 
must select R* based on min( DP, DP°S'). Of course, 
when the interference overlaps with both the preamble 


and the postamble, DP and DPSt will be similar, and 
our approach will incorrectly select a lower-than-optimal 
rate. Also, if the interfering packet is small enough to 
fit within the preamble and postamble of a transmission, 
AccuRate will fail to prescribe backoff although it will 
still estimate the rate induced by fading alone. One way 
to alleviate this problem is to insert known ”midambles”’ 
in different parts of the packet thereby allowing for mul- 
tiple glimpses into the channel behavior. We discuss 
these possibilities in section 7. 


4.4 Experimental Validation 


The above approach, AccuRate, raises a few obvious 
questions about its feasibility and performance. How ac- 
curately can a receiver determine the optimal rate in ret- 
rospect? Is the preamble sufficient or the postamble also 
necessary for estimating the rate in case of packet loss 
due to fading? To answer these questions, we conducted 
experiments on a Rayleigh fading channel simulator [11] 
and a real testbed. The evaluation setting is described in 
detail in Section 6. Briefly, in the simulator we froze the 
channel parameters for a Rayleigh fading model in GNU- 
Radio and computed the ideal rate R (by transmitting at 
all rates). Next, we allowed the receiver to determine the 
optimal rate R* from the received packet under identical 
channel conditions. We repeated this experiment with 


different channel parameters and transmit powers. Over- 
all, we compared R and R* for more than 2000 pack- 
ets. Fig. 8(a) shows the comparison by plotting the dif- 
ference between R* and R rate levels (e.g., Successive 
802.11 bit rates such as 24 Mbps and 18 Mbps are sep- 
arated by | rate level). These results indicate that when 
the packets are received correctly, R* = R for every in- 
stance. Even with preamble alone, AccuRate can deter- 
mine the rate correctly in 80% of the cases, and the ad- 
dition of postamble improves the accuracy to 95%. The 
postamble samples help AccuRate better estimate wire- 
less channel coefficients using symbol dispersion. We 
observed similar results over a real wireless channel be- 
tween a USRP/GNURadio transmitter and receiver pair 
— Fig. 8(b) shows these results. 


The above description and supporting results offer rea- 
son to believe that symbol dispersion information gath- 
ered from a received packet can be used to estimate the 
packet’s optimal rate. Considering that the channel co- 
herence time is expected to be in the order of multiple 
packets, the receiver can prescribe the same rate for the 
subsequent transmission. One could argue that the op- 
timal rate for the previous packet may not be optimal 
for the next packet or may even cause packet loss. But 
note that every rate adaptation scheme has to speculate 
the future channel conditions based on the past measure- 
ments. Any scheme that is not overly conservative and 
attempts to extract the best throughput from the channel 
runs the risk of packet loss. However, since our approach 
is based on fine grain information about the channel, the 
next packet has a reasonable chance of succeeding at the 
prescribed rate. In the following section, we gather our 
ideas into a single Constellation Based Rate estimation 
protocol, called AccuRate. 


5 AccuRate Protocol and Implementation 


5.1 Protocol 


The AccuRate module is located at the boundary of 
the PHY and MAC layer. For every outgoing frame, 
AccuRate concatenates it with a postamble. Upon re- 
ception of this packet, the AccuRate receiver performs 
the following checks and reacts accordingly. If the 
packet is correctly received, AccuRate estimates the best 
transmission rate, and piggybacks it in the acknowledg- 
ment (ACK). If the packet is incorrectly received (mean- 
ing that the preamble was decoded but the CRC check 
failed), AccuRate triggers an interference-detection op- 
eration. Learning that the failure was not due to interfer- 
ence, AccuRate estimates the appropriate rate using only 
the pre/postambles [1], and conveys this back through 
a negative acknowledgment (NACK). However, if inter- 
ference was the cause of failure, AccuRate performs rate 
estimation using either the interference-free preamble or 
postamble, depending on which exhibits lower symbol 
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Figure 8: Accuracy of determining the optimal rate in retrospect: (a) simulation; (b) test bed. The x-axis is expressed 
in rate levels, where two successive 802.11 rates are assumed to have a rate level difference of one. The AccuRate- 
correct-packets and AccuRate-pre+postamble curves overlap. These results correspond to channel conditions with 
fading but without interference (impact of interference on AccuRate is evaluated in Section 6). The optimal rate in the 
testbed scenario is determined by sending a train of small packets at all possible rates. (details in Section 6.1) 


dispersion. AccuRate conveys this fading-induced rate 
in the NACK, but also instructs the transmitter to back- 
off according to regular 802.11. In the worst case, if 
the packet’s preamble itself is non-decodable, AccuRate 
cannot perform any rate prediction. The transmitter does 
not receive any ACK/NACK, and retransmits the packet 
as per the 802.11 specifications. In all other cases, the 
transmitter adopts AccuRate’s rate and backoff prescrip- 
tions, and prepares accordingly for the next transmission 
to the same receiver. 


5.2 Implementation 


AccuRate builds on the OFDM codebase for the 
USRP/GNU-Radio platform. We adopt the publicly 
available building blocks of SoftRate (like the BCJR 
decoder [27]) for building AccuRate. This facilitates a 
platform for fair comparison between the two. 802.1 1a/g 
specified modulation schemes and channel coding rates 
are used (Table 1) in an attempt to emulate 802.11 like 
scenarios. The transmitter encodes the data using a 
standard rate-1/2 convolutional encoder, and applies 
puncturing to achieve varying code rates. The bandwidth 
is fixed at 20 MHz for GNURadio simulations and at 
2 MHz for testbed experiments. We have incorporated 
a Rayleigh fading channel simulator [11] into the 
GNURadio codebase. The OFDM implementation uses 
an FFT length of 1024, with 394 occupied tones, 8 pilot 
tones and a cyclic prefix of length 256. 


Figure 9 presents the block diagram for AccuRate’s im- 
plementation in GNURadio. SoftRate is also shown as a 
comparison point, especially because the two schemes 
use very similar modules. In SoftRate, an incoming 
packet is demodulated and passed through the BCJR 
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decoder. The output of the BCJR decoder comprises 
the data bits and their respective confidence values. 
These are passed through a BER computation module, 
resulting in the actual packet and its single BER. The 
SoftRate estimation algorithm runs in the “Select Rate” 
module, which picks the packet’s rate by comparing the 
BER against a BER-Rate relationship curve. The final 
output, 2, ¢rRate> 1S SoftRate’s prescribed rate. 


We note that AccuRate uses a similar module chain with 
a few augmentations. When the over-the-air packet ar- 
rives, AccuRate measures the symbol level dispersions 
from the demodulator and stores it in the Build Disper- 
sion Model module. This module uses the correctly re- 
ceived packet to calculate the accurate per-symbol dis- 
persion*. In addition, a random packet is generated 
and encoded at different rates (R,, Ro,...R,). Symbols 
from each rate-encoded packet are then subjected to the 
recorded dispersions, and the output is passed through 
the demodulator. Although Figure 9 shows the opera- 
tions in parallel (incurring an additional hardware cost), 
we use a single chain in our GNURadio implementation 
and iterate over all possible rate encodings (imposing a 
higher processing latency). The output of the demodula- 
tor, denoted Demod, is fed into the BCJR decoder. The 
output bits are collected into a frame and checked for 
CRC (the confidence values are not used in AccuRate). 
If the CRC check passes at that rate, AccuRate deems the 
corresponding rate to be successful. AccuRate picks the 
maximum of all successful rates, and prescribes it for the 
subsequent transmission. 


“Tf the over-the-air packet failed, the dispersion sequence is suitably 
built from the preamble and postamble only. 
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Figure 9: Block diagram of AccuRate 


Table 1: 802.11 Modulation and coding used in Accu- 
Rate (20Mhz channel) 


Modulation Coding 802.11 rate (Gmplemented?) 
BPSK 1/2 6 Mbps (yes) 
BPSK 3/4 9 Mbps (yes) 
QPSK 1/2 12 Mbps (yes) 
QPSK 3/4 18 Mbps (yes) 
16QAM 1/2 24 Mbps (yes) 
16QAM 3/4 36 Mbps (yes) 
64QAM 2/3 48 Mbps (no) 
64QAM 3/4 54 Mbps (yes) 


6 Evaluation 


6.1 Methodology 


We faced two challenges while evaluating AccuRate. (1) 
The wireless channel changes over time making it diffi- 
cult to determine what could have been the optimal rate 
for a given transmission. (2) The high latency incurred 
in procuring RF samples from the USRP front-end 
makes it impractical to evaluate AccuRate in realtime. 
In view of these, we make two approximations in our 
experimentation methodology. First, we incorporate a 
Rayleigh fading simulator into the GNURadio codebase. 
The simulator [11] employs the same USRP/GNURadio 
transmitter and receiver, only connects them through 
a loopback configuration. Packets flow out of the 


transmitter, and instead of advancing through the wire- 
less channel, they are made to flow through simulated 
channel conditions. The output of the channel simulator 
is presented to the receiver which then executes regular 
demodulation/decoding. Since the simulated channel 
conditions can be forced to remain unchanged, we 
are able to compare the optimal bit rate for a given 
transmission against those prescribed by AccuRate and 
other schemes. 


Our second approximation is designed to test the perfor- 
mance of AccuRate over real wireless channels. To this 
end, we repeatedly transmitted trains of packets, each 
train comprising of 7 short packets (each 200 bytes) at 
increasing bit rates. Assuming that the channel is co- 
herent for the duration of the packet train, we determine 
the optimal transmission rate R* by recording the highest 
bit rate successful in that train. Now, AccuRate picks a 
random packet in the first train, predicts the optimal rate 
for that packet A’... Rate: Lhe operation is performed 
offline — the difference between R* and R% ..., pate char- 
acterizes AccuRate’s rate selection accuracy. Moreover, 
the packet corresponding to R%....,zq¢- m the next train 
is also selected, as if that’s the transmission that Accu- 
Rate would have executed. This packet’s transmission at 
RvccuRate 1S then used to predict the subsequent trans- 
mission rate, and so on. The throughput is computed 
based on the success/failure of the packets selected in 
each train. SoftRate’s performance is also compared in 
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these settings. Thus, while simulators provide faithful 
comparisons under approximate channel models, packet- 
train based evaluations attempt to achieve the converse. 
We believe that together, these experiments provide a fair 
comparison between AccuRate, SoftRate, and the Opti- 
mal rate selection algorithms. 


6.2 Performance Results 


We have designed experiments to answer the following 
key questions about the performance of AccuRate. (1) 
What is AccuRate’s rate estimation accuracy compared 
to the optimal rate and other existing schemes? (2) 
How does the accuracy vary under different channel 
conditions? (3) How well does AccuRate discriminate 
between fading and interference? How does interference 
affect rate selection? (4) What is the accuracy of rate es- 
timation based on preamble and postamble dispersions? 


To understand AccuRate’s performance against existing 
schemes, we also evaluate SoftRate and SNR-based rate 
estimation. SNR-based rate uses the SNR feedback to 
pick the transmission bit rate. The SNR-rate relationship 
is derived a priori from a wide range of empirical mea- 
surements on USRP/GNURadios. 


Rate Selection Accuracy 


We evaluated the accuracy of rate estimation by Accu- 
Rate and other schemes in both slow fading (walking) 
and fast fading (driving) scenarios as described below. 


Slow Fading: We induced slow fading by moving 
the USRP transmitter on a wheeled chair, while it is 
transmitting to a fixed USRP receiver. We transmitted 
500 packet trains where each train has one packet per 
rate. We repeat this experiment 10 different times and 
thus resulting in a total of 5000 packet transmissions 
per each bit rate. Figure 10(a) shows the results of 
these testbed experiments by plotting the CDF of the 
estimation accuracy. A negative value of the difference 
between estimated rate and optimal rate indicates under- 
selection and a positive value indicates overselection. 
AccuRate selects the optimal rate nearly 95% of the 
time, which is around 10% and 20% better than SoftRate 
and SNR-based scheme respectively. We also conducted 
simulations by setting the channel parameters to reflect 
slow fading. In this case, as shown in Figure 10(b), 
AccuRate is always optimal and again performs better 
than the other two schemes. Based on these results, we 
conclude that AccuRate estimates the rate with high 


accuracy under slow on 
To get a sense of how well the predicted rate by Accu- 


Rate tracks the optimal rate in a time varying channel, 
we take a closer look at AccuRate rate selection. We 
plot the AccuRate rate and optimal rate at each point for 
a 300 train snapshot in Figure 11. Clearly, AccuRate 
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Figure 10: Rate selection accuracy under slow-fading 
mobility: a) Testbed vs (b) Simulation results. The X- 
axis shows the difference in discrete rate levels. 


tracks the optimal rate curve reasonably well. 


Fast Fading: Doppler effects at vehicular speeds cause 
fast fading in wireless channels. We examine Accu- 
Rate’s performance by simulating such conditions in the 
Raleigh Fading channel simulator for GNURadio [11]. 
This simulator implements detailed channel models 
including multipath. The inputs (and outputs) to this 
simulator are drawn from (and sent to) GNURadio. The 
system parameters are configured to emulate various 
channel coherence conditions. Doppler Shift is varied 
between 400Hz to 4KHz, translating to channel coher- 
ence time of Ims to 100 ys. This captures the range of 
mobile channel conditions. We sent 25000 packets of 
size 700 bytes for each Doppler Shift. The channel is 
replayed for every scheme for performance comparison. 


We present the rate selection accuracy for each scheme 
under varying channel coherence time in Table 3. We 
show only the accurate and over-selection percentages 
and omit the under-selection percentages (which can 
be inferred as they total 100%) for clarity. SNR-based 
schemes underestimate or overestimate the rate in 
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Coherence Time 


a S00} 2008 100} 
[Accuracy | OverSelect | Accuracy | Over-Select | Accuracy | Over-Select | Accuracy | Over-Select_ 


95% 3.1% 


Scheme 


AccuRate 98% 1% 98% 1% 97% 2% 
SoftRate 83% 0% 86% 6% 718% 4% 80% 14% 





SNR 719% 14% 57% 21% 60% 24% 54% 18% 


Table 2: Rate selection accuracy under various fading conditions (simulated). 
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Figure 11: Close-up of AccuRate rate selection under time varying channel. 


around 40% of the cases when the coherence time is 
less than Ims. This is due to the changes in SNR-BER 
relationship with the change in coherence time. Also, 
SNR is calculated only during the preamble which does 
not capture the entire packet duration. The accuracy of 
AccuRate and SoftRate remains relatively consistent 
across different coherence times, though AccuRate still 
outperforms SoftRate by around 12%. This is an effect 
of fast-changing channel conditions, requiring a rate 
estimation scheme to jump multiple levels in one step. 
AccuRate executes these jumps effectively. 


Interference Detection and Rate Selection: Rate 
estimation under interference is a challenging problem. 
A receiver must first detect that there is interference. It 
should then estimate the dispersion due to fading alone 
to determine the best rate for the packet under fading. 
Existing schemes have focused on discriminating be- 
tween fading and interference, and have proposed to 
backoff when losses are due to interference. AccuRate 
tries to characterize fading even in case of interference 
losses [1], and account for fading alone in rate prescrip- 
tion. To evaluate these capabilities, we first evaluate 
AccuRate’s accuracy in detecting interference, followed 
by rate selection under both interference and fading. 
Interference Detection Accuracy 

In our experiment, we varied the position and power of 
the transmitter and interferer to obtain various realistic 
topologies. As a result, the SINR varies from 0 to 
12 dB. We ensured that the primary link has high 
packet delivery probability (> 0.9) in the absence of 
interference. Now, under interference, we considered 
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transmission bit-rates 


the packets that failed the CRC check. We compute the 
symbol dispersion distributions DP"® and DPS for the 
preamble and postamble of the CRC-failed packet. We 
attribute the loss to interference if these distributions 
are not “similar”. Two distributions are declared similar 
if more than 50% of samples of one distribution falls 
within three sigma limits of the other distribution (we 
model the dispersions with a Gaussian distribution). 
Figure 12 presents the detection accuracy results for 
varying transmission rates. AccuRate’s interference 
detection accuracy is slightly lower than SoftRate for 
low bit rates. This is because low bit rates can tolerate 
high dispersion and therefore dispersions of preamble 
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and postamble tend to be similar even in case of a 
loss with interference. On the other hand, AccuRate 
performs much better than SoftRate at higher rates and 
accurately diagnoses loss in more than 95% cases. 


Testbed-Softrate 





Overselectt EZ 
Accurate 22227 














Underselect = 




















Fraction of Lost Packets 


























0 
BPSK 1/2 BPSK 3/4 QPSK 1/2 QPSK 3/4 


Bit Rate 


Testbed-AccuRate 


QAM16 1/2. QAM16 3/4 





Overselectt EE 
Accurate E2277 
Underselect = 














Fraction of Lost Packets 






































QAM 16 3/4 


QPSK 1/2 


Bit Rate 


0 
BPSK 1/2 BPSK 3/4 


Figure 13: Rate prescription accuracy in the presence of 
interference: (a) SoftRate; (b) AccuRate. 


Rate Estimation Accuracy with Interference 

When interference is detected by AccuRate, it estimates 
the rate supported by the channel under fading alone 
based on the lower of the dispersion values among 
preamble and postamble. Fig. 13 compares the rates 
prescribed by SoftRate and AccuRate with the optimal 
rates. In the presence of interference, SoftRate’s estima- 
tion of the rate supported by the channel is not optimal 
in 20% to 30% of the cases. Whenever SoftRate fails 
to identify interference, it computes a conservative rate. 
AccuRate does not perform well at lower rates either (as 
explained above), but is still better than SoftRate. On 
the other hand, AccuRate performs quite well at higher 
transmission rates as it prescribes the best rate in above 
92% cases. These results show that AccuRate is quite 
robust under varying channel conditions with slow/fast 
fading and with/without interference. 
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Throughput 


The projection of effective rate selection on the link’s 
throughput is of interest. Figure 14 compares the 
throughput between AccuRate, SoftRate, and SNR- 
based rate estimation. The simulation results in Figure 
14(a) are obtained for varying channel coherence times. 
As expected, all schemes suffer performance degrada- 
tion with shorter coherence times. However, AccuRate’s 
ability to pick the optimal rate from correctly received 
packets permits the subsequent packet to succeed as 
well. This is a positive feedback that results in good 
performance, particularly because a large fraction of the 
packets are received correctly. SoftRate outperforms 
SNR-based rate selection, but still remains below the 
AccuRate throughput. 


























Simulation 

1 

: AccuRate EEE 
~ 16) Softrate (223 
a fa | SNR based 
Oo 
= 12} 
2 10 + 
& 8 
g °] ED 
BE 7] y 

0 Zz 

lms ms .2ms 
Channel Coherence Time 
Testbed 
Ss AccuRate Mmmm 
QO. Softrate 22205 
za 1 | SNR based =m 
5 
O 
= Og. 
EH 
3 06} 
= 
Ss O04) 
E 0.2 
Z. . 
0 

















O 1 2 3 4 5 6 7 8 9 11 


Walking Trace Number 


Figure 14: Throughput comparison under (a) simulated 
slow fading channels, (b) walking experiments on the 
USRP/GNURadio testbed. 

Figure 14(b) shows the throughput comparison from 
testbed experiments (the receiver was moved with walk- 
ing speeds). We briefly summarize the experiment 
methodology here. Recall that 7 short back-to-back 
packets (called a packet-train) are being repeatedly trans- 
mitted to determine the optimal rate during each train. 
For a packet-train T;, say packet 7’s rate was estimated 
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by AccuRate, denoted as Aj, Now, for packet-train 
T;41, the short packet that was transmitted at rate [;', 1s 
selected. Observe that, when running as a full system, 
this is the packet that AccuRate would have transmit- 
ted. Now, if R;, was an incorrect estimate, this packet 
in 7;41 would be received in error, implying that Ac- 
cuRate would have to make the next estimate based on 
this erroneous packet. Continuing this process, we com- 
pute the number of packets successfully received at the 
receiver, and the total time incurred for their transmis- 
sions. The throughput for SoftRate and SNR-based- 
Rate are also computed as above. The Optimal through- 
put is computed based on the highest achievable rate in 
each packet-train. Evidently, AccuRate consistently out- 
performs SoftRate and SNR, except in a few occasions 
where the performances are comparable. On average, 
AccuRate achieves 87% of the optimal throughput pos- 
sible while SoftRate accomplishes 75%. We summarize 
by observing that SoftRate leaves a small room for im- 
provement, and AccuRate makes that room even smaller. 


Efficacy of Pre/Postamble based Models 


When packets are received erroneously, recall that Ac- 
cuRate uses only the known preamble and postamble to 
model the channel-induced dispersion. This is clearly 
an approximation and will cause sub-optimal rate esti- 
mation. Moreover, the postamble is an additional over- 
head, and hence, reducing its size is of interest. Fig- 
ure 15 illustrates the performance of rate estimation with 
preambles and two different-sized postambles. In the 
simulation results (Figure 15(a)), the performance with 
preamble alone achieves around 80% accuracy, 12% rate 
over-selection, and 8% under-selection. Including half 
the postamble improves the accuracy to around 89%, 
while the preamble and the postamble together can offer 
nearly 98% accuracy. Of course, in the testbed results 
(Figure 15(b)), the rate estimation accuracy degrades be- 
cause the pre/postambles may not always capture the dy- 
namism of the wireless channel. Thus, while estimating 
the rate of the received packet, AccuRate may be opti- 
mistic about the channel fluctuations, thereby selecting 
higher rates. However, we still find that the degradation 
is slight. When the preamble and postamble are both 
used, rate over-selection is around 5%, and accuracy is 
94%. We note that the testbed experiments are performed 
for walking scenarios. We believe that in exchange for 
the postamble overhead, a 5% over-selection and a 94% 
rate estimation accuracy is a decent tradeoff. Note that 
with correct reception of the packet, rate estimation ac- 
curacy increases further. 


7 Deficiencies, Ongoing Work 


AccuRate is promising but not yet ready for full-scale 
deployment. We discuss some of the deficiencies and 
directions of ongoing work. 
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Figure 15: Efficacy of pre/postamble based dispersion 
models for (a) simulation, and (b) testbed. Simulations 
configured with slow fading channels, while testbed re- 
sults are from walking experiments. 


(1) Pre/Postambles produce inaccurate channel 
modeling. When a received packet fails the CRC 
check, AccuRate extracts only the (BPSK modulated) 
preamble and postamble to model the channel-induced 
dispersions. ‘These dispersions are replicated to form 
a packet-long dispersion sequence, and “replayed” on 
packets at other rates. Clearly, this 1s an approximation 
and does not capture channel variations that may have 
occurred between the preamble and the postamble. 
Naturally, AccuRate’s selection accuracy deviates from 
the optimal. One way to address this issue could be 
to introduce midambles in the packets, 1.e., known 
symbols that are interspersed with the actual data 
symbols. Midambles will offer additional “glimpses” 
into the channel’s behavior, allowing for a better channel 
dispersion model. We are investigating the potential 
benefits of midambles as a part of our ongoing work. 


(2) Overhead of ambles. Postambles and midambles 
are overheads introduced by AccuRate. This overhead 
can be viewed as the price of improved rate estimation 
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accuracy (for erroneously received packets). If this 
overhead is deemed unacceptable (perhaps for shorter 
packets), we plan to test a few other ideas. First, the 4 
pilot tones used for equalization in each OFDM symbol 
may be used to replace the post/midambles. These tones 
are known, and may serve AccuRate well for estimating 
dispersions. Second, we observe that SoftRate does 
not rely on the post/midambles; instead they utilize the 
confidence values of all (correct or incorrect) symbols. 
We envisage that when the packet has failed, Softrate 
could be triggered to pick reasonably good rates. When 
the packet is received correctly, AccuRate could predict 
the optimal rate. Such a fusion of the two schemes is 
likely to be better than any one. 


(3) Implementation complexity. Our primary focus 
while designing AccuRate has been on estimating the 
optimal bit rate, unconstrained by the complexity and 
cost of its implementation. In practice, the hardware 
cost and the implementation complexity will be high. 
We intend to optimize for these factors in our future 
work. However, even if cost and complexity can be side- 
stepped, AccuRate will still need to meet the latency 
constraints of IEEE 802.11 (i.e., the rate estimation 
process must complete with SIFS time window of 9us). 
This may be a concern even when implemented in 
hardware. However, we observe that several components 
of AccuRate, as organized in Figure 9, are amenable 
to pipelining and speculative operation. For instance, 
while the symbols are being received, one may speculate 
correct packet reception and form the dispersion vector 
from the already-received symbols. This dispersion 
vector can be replayed” on a random packet, and hence, 
the ’replay” operation can be pipelined with the actual 
over-the-air reception. Since this packet is assumed to 
be correct, replaying needs to be performed only for bit 
rates that are higher than the packet’s actual transmission 
bit rate. The replay operation can easily be performed at 
least as fast as the actual reception (potentially using the 
same hardware), and hence, the SIFS constraint can be 
met for correctly received packets. 


To account for the case of packet failure, the dispersions 
can be modeled from the preamble alone, and replayed 
in a parallel hardware pipeline. This will also meet the 
timing constraints, but at the expense of less accurate 
dispersion model. While using the postamble will 
improve this model, its implementation may violate 
timing constraints because the receiver will have to 
wait till the end of packet reception to perform the 
replay. To address this concern, we envisage trading 
off hardware cost, interference-detection accuracy, or 
per-packet overhead. (1) By incurring a greater cost, the 
receiver could incorporate a bank” of replay chains. 
Once the postamble arrives at the end of the packet, the 
receiver could model the dispersion, and replay them 
concurrently on symbols from the tail of the packet. 
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(2) If this hardware cost is unacceptable, an alternative 
could be to move the postamble earlier in the packet. 
An earlier postamble will model the dispersion in time, 
which can then be replayed on the symbols arriving 
over-the-air. At the risk of not detecting interference 
that arrives during the tail end of the packet, the early 
postamble may reduce hardware cost and meet the 
desired time constraints. (3) Another alternative could 
be to include a midamble in addition to the postamble, 
which elongates the packet but aids in both accurate and 
timely rate estimation, and interference detection. 


In summary, even if network processors operate at the 
same speed as wireless reception, it may be possible to 
meet timing constraints through speculation and pipelin- 
ing. Depending on the outcome of the final CRC check, 
the corresponding replay thread can be used. Of course, 
if processor speed exceeds that of wireless reception, the 
cost, complexity, and accuracy, is likely to improve in 
favor of AccuRate. A more careful investigation of this 
space is a topic of future research. 


$ Conclusion 


This paper asks the question, for any received packet, can 
we determine the optimal rate at which the packet should 
have been transmitted. This information is valuable be- 
cause the optimal rate can help the link layer with im- 
pending rate selection. In an attempt to answer this ques- 
tion, we propose AccuRate, a constellation based rate 
estimation scheme. AccuRate exploits symbol level in- 
formation to characterize the channel’s distortion on the 
incoming packet, and then “replays” this distortion on 
other rate-encodings of the same packet. The maximum 
rate that succeeds is deemed as the optimal rate. When 
such a retrospective method is used to decide on impend- 
ing transmission rates, we find that AccuRate achieves 
higher throughput than SoftRate. The performance is of- 
ten close to the optimal when time-separation between 
packets are small and the transmissions are in static or 
slow-moving scenarios. Our ongoing work is simultane- 
ously focussed on extending AccuRate to high mobility 
environments, while also making the protocol viable in 
terms of hardware cost and complexity. 
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Abstract 


Current WiFi Access Points (APs) choose transmission 
parameters when emitting wireless packets based solely 
on channel conditions. In this work we explore the 
benefits of deciding packet transmission parameters in a 
content-dependent manner. We demonstrate the benefits 
specifically for media delivery applications in WiFi en- 
vironments by designing, implementing and evaluating a 
system, called Medusa. In order to keep the APs rela- 
tively simple, we implement the Medusa functions in a 
media-aware proxy. More specifically, when forwarding 
our media traffic, Medusa requires that APs simply use 
the WiFi broadcast feature, and that they refrain from 
making decisions on which wireless packets to retrans- 
mit, or what PHY rates such packets should be trans- 
mitted at. Instead we combine these typical link layer 
functions with a few other content-specific choices, in 
the proxy. Through detailed experiments across diverse 
mobility and interference conditions we demonstrate the 
advantages of this scheme for both unicast and multicast 
media delivery applications. The advantages are partic- 
ularly substantial in multicast scenarios, where Medusa 
was able to deliver a 20 Mbps HD video stream simul- 
taneously to 25 clients, using a single 802.11 AP, with 
good to excellent PSNR. 


1 Introduction 

Robust delivery of rich media content over wireless links 
is an increasingly important service today. As more 
and more high quality media content becomes available 
through the Internet, the user expectation of accessing 
such content over their wireless enabled devices continue 
to grow. Examples include users watching on-demand 
shows and movies from sources such as Hulu and Net- 
Flix, students in a university campus interested in fol- 
lowing online lectures while sitting in the cafeteria, and 
employees in a company watching a company-wide pre- 
sentation by their CEO, whether at home, at work, or 
while sitting in a coffee shop. While the widely deployed 
WiFi technology can provide adequate performance for 
relatively lower quality media streams, the user experi- 
ence when watching high quality streams (e.g., HD qual- 
ity content) leaves much to be desired. In this paper, we 
attempt to push the envelope of media delivery perfor- 
mance for WiFi systems, by exploiting some knowledge 
of media content in making transmission parameter se- 
lection at the wireless transmitter (APs). While our pro- 
posed approach applies equally to unicast as well as to 


multicast scenarios, the biggest advantage of the system 
arises in the multicast case where multiple users are in- 
terested in the same content. 

A target campus application and challenges: The IT 
department of the UW-Madison campus is interested in 
providing high quality broadcast of specific educational 
content through its intranet. Such capability would allow 
students sitting in dormitory rooms, in union buildings, 
in cafeterias, and in libraries, to follow the classroom. 
Further, the system would also allow easy and conve- 
nient dissemination of live guest lectures from remote 
locations, without requiring the guest lecturer to visit the 
campus. While the wired backhaul has sufficient capac- 
ity to carry such media traffic, initial experiments by the 
IT department revealed obvious performance problems 
on the WiFi based last hop. In particular, they observed 
that even if 3 users connected to a single 802.11g AP 
attempted to watch the same HD video stream, the per- 
formance of the system was abysmally poor !. 

The poor performance of media delivery over WiFi 
for multiple users requesting the same content, is a com- 
bined effect of three factors: (4) HD quality video places 
a high bandwidth demand on the wireless medium — 
commercial HD encoders, such such as the Streambox 
SBT3-9200 [6], create content with data rates ranging 
from 512 Kbps to 30 Mbps, (11) WiFi typically employs 
802.11 unicast mode for sending similar content sepa- 
rately to each user, and (111) the WiFi transmitter makes 
various configuration choices, e.g., PHY transmission 
rates, number of re-transmission attempts, etc., for each 
wireless packet, without any knowledge of the relevance 
of its(packet’s) contents to end applications. In this pa- 
per, we design and implement a system called Medusa — 
Media delivery using adaptive (pseudo)-broadcasts, that 
can efficiently address issues (11) and (iil). 


1.1 Medusa approach 


The 802.11 transmitter typically transmits packets in 
the FIFO order and makes multiple decisions for each 
wireless packet transmission. This include channel con- 
tention, 1.e., when to attempt wireless transmission, se- 
lection of the PHY rate for the packet, and the number 
of re-transmission attempts to make in case of failures. 
In this paper we contend that many of these decisions, 
namely PHY rate selection, number of re-transmission 


An 802.11n AP can potentially scale performance to upto 10-12 
users watching such HD content simultaneously. The typical number 
of users in busy parts of campus is often much higher. 
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attempts for each packet, and the order of packet trans- 
missions, are better made by taking the “value” of the 
wireless packet to the application into account as well. 
Let us consider MPEG-encoded [4] video content that 
consists of I-, P-, and B- video frames. Given that a 
packet carrying I-bits is more important, the PHY rate 
of such a packet can be picked more conservatively, than 
value-unaware rate adaptation algorithms, e.g., Sam- 
pleRate [8]. This would ensure that the loss probability 
of packets carrying I-bits are particularly low. Similarly, 
if the wireless channel capacity is scarce and packet er- 
rors are high, then it is more important to devote greater 
re-transmission effort for packets carrying I-bits, than 
packets carrying P- and B-bits. 

Hence in Medusa, we offload these decisions for our 
media traffic to a media-aware proxy that can inter- 
pret the value of the data. More specifically, APs in 
Medusa no longer perform link-layer re-transmissions or 
PHY rate selections. Instead, the proxy examines the 
value of each packet to applications, and instructs APs to 
(re)transmit these packets in a certain order and at speci- 
fied PHY rate. 


Prior work, e.g., Trantor [21], has considered a model 
in which a centralized controller decides transmission 
parameters, e.g., PHY rates, transmit power, etc. for dif- 
ferent APs and clients in an entire enterprise WLAN. At 
a high level, our proposal of proxy-based PHY rate se- 
lection and re-transmissions may appear similar. How- 
ever, there is a fundamental difference between the two 
proposals. Trantor suggests a centralized rate selec- 
tion in a content-agnostic manner, based solely on po- 
tential interferences between different conflicting wire- 
less links. The approach in Medusa augments this de- 
cision by incorporating knowledge about value of the 
packet contents to applications in deciding the PHY rate 
of packets, the order in which they should be transmit- 
ted, and the number of re-transmission attempts to be 
made. It also optionally utilizes simple network coding 
approaches [23] to improve efficiency. 

Unicast vs broadcast vs pseudo-broadcast: In a sce- 
nario where multiple users are requesting the same con- 
tent (the same media stream, for example), we advocate 
the use of 802.11 standard’s broadcast mode of opera- 
tion. The choice is motivated by the observation that 
a 802.11 broadcast packet can be used to communicate 
content simultaneously to all receivers. This would sub- 
stantially reduce the load on the wireless medium. How- 
ever, MAC-layer broadcast packets do not elicit MAC- 
layer acknowledgments from receivers, leaving the trans- 
mitter unaware of losses on the wireless channel. This is 
a problem for any broadcast-based wireless system, as 
the transmitter can no longer decide which packets re- 
quire re-transmissions. Further, absence of loss infor- 
mation makes it impossible to determine an appropriate 
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Figure 1: Schematic of the Medusa shared media deliv- 
ery system. 


PHY rate adaptation mechanism. Hence, we incorpo- 
rate higher layer acknowledgments from the clients to 
the proxy, that help the latter in making these decisions 
for the APs in a content-dependent manner. 


In a single user scenario, we could use 802.11 unicast 
transmissions. However, even in such settings, it is pos- 
sible to exploit ER-style network coding opportunities 
when transmitting wireless packets [23]. Such network 
coded packets, by definition, have multiple intended re- 
cipients. Broadcast-based 802.11 packets are suitable to 
facilitate this enhancement as well. 


Lack of MAC-layer acknowledgments is problematic, 
however, for one more link layer decision that has to 
be taken by the APs — channel contention. Lack of 
acknowledgments indicating packet losses, prevent APs 
from inferring how the MAC layer backoff counters 
should be adjusted. Given the lack of MAC-layer ac- 
knowledgments for broadcast packets, the existing meth- 
ods for channel contention are likely to fail. Hence, we 
use pseudo-broadcast packets in Medusa to communi- 
cate all broadcast media content, analogous to what was 
proposed in COPE to deliver network coded multicast 
data [18]. In pseudo-broadcast, 802.11 unicast mode is 
used where one of the receivers is picked at random to 
be the explicitly stated (unicast) recipient, while other 
intended recipients simply overhear the packet. MAC- 
layer acknowledgments arrive from the explicit recipient, 
and backoff parameters can be appropriately adjusted. 
Note that these MAC-layer acknowledgments are inter- 
preted by the APs solely for the backoff adjustment func- 
tion, and not used by the proxy to decide PHY rate, trans- 
mission order, or eligibility for re-transmission of pack- 
ets. 


Medusa system overview: Figure 1 shows a 
schematic of different aspects of the Medusa system, 
including an unchanged media server, a media-aware 
Medusa proxy, and APs and clients with minor software- 
level changes. The Medusa proxy intercepts all IP pack- 
ets corresponding to various video frames and relays 
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them to the AP for transmission. The proxy instructs 
the AP to use 802.11 pseudo-broadcast for each wireless 
packet (irrespective of whether it is part of a multicast 
or a unicast stream) and also informs the AP what spe- 
cific PHY rate to transmit the packet at. The Medusa 
proxy makes these decisions, using a combination of 
four mechanisms: (i) WiFi reception reports: Each client 
provides a periodic reception report to the proxy about 
various wireless packets that it did or did not receive. 
While analogous to Reception Reports in RTCP [24], 
the reception reports in Medusa differs from those in 
RTCP in the detailed MAC layer information that is car- 
ried in Medusa for rate adaptation and re-transmission 
purposes. (ii) Estimating the value of a packet to me- 
dia applications: Not all packets are of equal impor- 
tance to receivers. When the wireless channel capacity 
is scarce, the value of each packet determines the choice 
of PHY rate and the number of re-transmission attempts 
to be made to deliver the packet. We use a simple per 
packet value assignment function at the Medusa proxy 
to determine the priority level of each packet encapsu- 
lating a media stream. (iii) PHY rate adaptation and re- 
transmissions for broadcast packets: We design a PHY 
rate adaptation and re-transmission strategy for broad- 
cast wireless packets that cannot depend on MAC-layer 
acknowledgments. Our rate adaptation scheme is two- 
paced. A conservative baseline PHY rate is identified at 
a slow timescale, and an individual PHY rate for each 
packet is chosen subsequently using an algorithm called 
Inflate-Deflate. (iv) Packet order selection and network 
coded re-transmissions: We modify the ordering of me- 
dia packets from traditional FIFO, especially when the 
channel quality is poor. Under bad channel conditions, 
it is beneficial to prioritize packet transmission based 
on packet “value”. This would increase the probability 
of successful reception of important packets. Further, 
the transmission order is also selected such that there 
are proxy-initiated re-transmission opportunities for the 
more important packets. During re-transmissions, we 
also leverage the gains of ER-style network coding. 

Key contributions: Summarizing, the key contribu- 
tions of this work is two-fold: 


(1) We propose an intuitive design of a pseudo- 
broadcast based WiFi system for media delivery at high 
video rates. The design incorporates various aspects of 
rate adaptation, re-transmission, and packet priorities, 
coupled with higher-layer feedback. 


(11) We integrate all of these ideas together into a func- 
tional Medusa system and present detailed evaluation of 
this system. Our results show that Medusa provides ro- 
bust, high-bandwidth (upto 20 Mbps), HD quality me- 
dia delivery to tens of co-located WiFi users interested in 
the same content, all sharing the same WiFi AP, across a 


range of channel conditions and mobility scenarios. Our 
technique is applicable to unicast media delivery scenar- 
10s as well. 


2 Medusa design overview 


We describe the design of Medusa by using the exam- 
ple of MPEG4-encoded [4] video delivery over wireless. 
In MPEG4 the video content is partitioned in an inde- 
pendently decodable sequence of pictures, called Group 
of Pictures (GOP). Each GOP has frames of three dif- 
ferent types: I, P and B. Each frame, in turn are broken 
down into multiple packets which are then transmitted 
over wireless channel. 

At the receiver, the I frames can be correctly decoded, 
as long as all constituent packets are received. For decod- 
ing P packets, the successful reception of previous I or P 
frame in sequence is also necessary. Finally, to decode a 
B frame, the previous as well as the next I or P frame in 
sequence are needed. Put another way, an I frame does 
not depend on any other frame, while P frames depend on 
another frame and B frames depend on two other frames. 

As described, Medusa involves an unchanged media 
server, a media-aware proxy, and APs and clients, with 
small software modification. The only change in the 
AP is to create a single functionality — for each packet 
forwarded by the proxy to the AP, the proxy should be 
able to specify the PHY rate of transmission. The AP 
simply accepts each such packet (which can be an orig- 
inal packet, a previously transmitted packet, or a net- 
work coded packet) and transmits them using the pseudo- 
broadcast mode using the PHY rate specified by the 
proxy. The AP continues to perform back-off decisions 
independently based on MAC-layer acknowledgments 
received for its pseudo-broadcasts. The client is modi- 
fied to incorporate WiFi reception reports targeted to the 
proxy. These reception reports include a higher layer ac- 
knowledgment (ACK) bitmap, and is sent infrequently 
by the clients, roughly once every 100 packets or 100 ms. 
Since the proxy knows the PHY rates at which different 
packets were transmitted, it can use these reception re- 
ports to infer packet losses observed by different clients 
at different PHY rates. 

Based on this simple setup, the design problem of 
Medusa can be stated as, 

For a set of k video packets (including both original 
packets and packets that need re-transmissions), deter- 
mine the order of transmission and PHY rates for the 
packets, and pass this information along with the pack- 
ets to the APs. Further, determine whether some of these 
packets should be network coded, and whether some of 
them should be discarded. 

We present a particular solution to the above problem 
in the rest of this section that exploits knowledge of the 
value of each packet to applications. 
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While we describe our scheme for MPEG4 video, our 
approach generalizes to any other media encoding, where 
the content is structured in layers, and there are different 
levels of priority (value) for each layer. 


2.1 Determining value of packets 


As mentioned above successful decoding of video frames 
might need reception of other video frames. Hence, all 
else being equal, the value of each video packet depends 
on how many other packets (or bytes) depend on this 
packet for correct decoding of various video frames. Nat- 
urally, I-frame packets become more important than P- or 
B-frame packets. The value of video packets is also in- 
fluenced by its impending playback deadline and that of 
other dependent video frames. Packets for video frames 
that are approaching display deadline are more impor- 
tant. Finally, given that many packets are relevant to 
more than one client (true for original and re-transmitted 
and network coded packets), the value of a packet should 
also grow with the number of intended recipients. 

Previous research on video encoding for streaming [9, 
12, 19,26] has proposed LP based techniques for deter- 
mining the priority of video frames. Such techniques 
typically utilize a directed acyclic graph (DAG) of video 
frame dependencies along with a (empirical/theoretical) 
channel error model to determine relative value for 
frames. 

While such sophisticated designs of packet value can 
certainly be used, in this paper we consider a relatively 
easy to compute function to determine packet value to 
applications that illustrates its usefulness in making rate 
adaptation, re-transmission, packet ordering, and net- 
work coding decisions based on the worth of packet con- 
tents. 

In our scheme, we assign a weight, X, to each video 
frame, based on how many bytes the frame can help de- 
code (including itself). This weight is given to all con- 
stituent packets of this frame. Our media-aware proxy 
knows the video encoding process, and can calculate 
X by buffering and observing packets corresponding to 
each GOP before making transmission decisions. We 
next assign a weight, C’, proportional to the number of 
intended recipients of each packet. Finally, we assign 
a third weight, D, based on the delay until the display 
deadline of this frame. We normalize all these weights to 
the same scale, and assign the value of the packet to be 
the product of these normalized weights, thus, 

Value = X x C/D. 


2.2 Determining a base PHY rate 


Our PHY rate selection process is two paced. Initially, 
we compute a conservative PHY rate for all the packets. 
If channel capacity is abundant, all packets will simply 
be transmitted at this rate to enhance the possibility of 
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successful reception. However, if the channel is error 
prone, then some of these rates will be updated, as de- 
scribed in Section 2.3. The timescale for adapting base 
PHY rate depends on the reception report frequency of 
clients (roughly every 100 ms in our current implemen- 
tation). 

We pick the highest PHY rate such that the expected 
error probability of the packets at all clients would be 
below a certain threshold (set to a low value) as the 
base PHY rate. By retaining PHY rate information for 
all transmitted packets, the proxy, on receiving ACK 
bitmaps from client, can infer the necessary error rates. 
In case statistics for certain PHY rates are missing (pos- 
sibly because no packets are sent at these rates), standard 
interpolation techniques can be used to estimate the ex- 
pected error rates. 

An important distinction of our broadcast rate assign- 
ment from typical 802.11 unicast rate assignment algo- 
rithms( [8, 30]) is that, it does not favor a higher rate to 
merely increase the channel utilization. Instead it tries to 
ensure high reception probability across multiple broad- 
cast receivers (who might have diverse channel condi- 
tions). 

Another distinction of Medusa rate adaptation from 
unicast stems from inability of Medusa to adapt quickly 
to changed network conditions, due to delayed ACKs. 
This can result in Medusa persisting at high data rate 
even when channel quality has deteriorated resulting in 
high errors. To ensure that such a situation does not 
occur, we update the error characteristic of the PHY 
rates with a EWMA function with heavy weight on his- 
tory(thus preventing it from reacting to transient fades of 
the channel). 

Transmitting packets at base PHY rate ensures that 
the packets have a good likelihood of successful recep- 
tion. However, if the base rate of the system is too low 
(say, due to presence of clients with bad channel quality) 
which limits successful transmission of all video packets 
before their respective deadlines. Under such circum- 
stances, we selectively increase the transmit rate of dif- 
ferent packets in a certain order as described next. 


2.3 Packet order and actual PHY rate 


The problem of deciding the video packet schedule while 
maximizing the delivered quality across a group of users 
is known to be NP-complete [12]. Hence, we use a 
heuristic algorithm for packet ordering. We now de- 
scribe our mechanism to determine the transmission or- 
der and PHY rate of packets, using an algorithm, we call 
Inflate-Deflate. The heuristic schedules a batch of pack- 
ets in each round. For ease of exposition we assume the 
presence of a virtual timeline and our goal is to place 
packets on this timeline and determine their PHY rate. 
Placing the packets at a given timeslot signifies schedul- 
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Figure 2: Packet ordering and final rate assignment car- 
ried out by Medusa. 


ing the transmission of the packet at that instant. Packets 
(including retransmission candidates) are added onto the 
timeline in the decreasing order of packet value (as cal- 
culated in Section 2.1), 1.e., higher valued packets are 
placed first, and lower valued packets later. 

We describe our algorithm for ordering packets next 
and illustrate it with a toy example. We consider 
A, B,C, D, E, F and G as the seven packets which need 
to be transmitted (Figure 2). The width of a packet sig- 
nifies the time required to transmit the packet at its base 
PHY rate. Initially we try to place each packet at its ideal 
timeslot — a time by which it needs to be transmitted 
so that it can be re-transmitted multiple times in case of 
consecutive losses. Also, the latest timeslot at which a 
packet can be placed is when it just makes its playback 
deadline for the slowest client, called the deadline slot 
for the packet. The reason for not placing a packet at 
the earliest available slot, is to ensure that packets which 
have lower value but have an earlier deadline than the 
more important packets still get a chance to be scheduled. 
When the current time is past a packet’s deadline slot, 
and it is not required for decoding any other packet at 
any receiver, the packet can be discarded. We now walk 
through the example of how different packets get placed 
on the scheduling timeline using the following cases. 

Case-I (Sufficient time is available to schedule all 
packets at ideal time-slot): This scenario is depicted in 
Figure 2(a). Here, packets A, B, and C’ are scheduled at 
their ideal timeslot. Also, the packets are assigned their 
base rates. 

Case-II (Ideal slot occupied by a higher valued 
packet): Under realistic settings, contention from other 
traffic sources and the necessity to retransmit different 


packets would mean that scheduling all packets at their 
ideal slot might not be possible. We depict such a sit- 
uation in Figure 2(b), in this case scheduling packet D 
at its ideal slot would lead to an overlap with packet C’. 
We mandate that the packet with the lower value be the 
one which gets moved around. We find a best fit timeslot 
in the schedule for the lower valued packet D. Note by 
considering packets in order of their value implies that 
only the current packet needs to be moved. 


Case-III (No slots left at current PHY rate): In ex- 
treme case, we might be unable to find a big enough time 
slot for a packet, for example in Figure 2(c) we are un- 
able to find a timeslot big enough to fit packet F’ at its 
base rate, before its deadline. In such a case, we increase 
the data-rate of the packet, we call this operation Deflate 
as it results in shortening the packet dimensions on the 
transmission timeline. For example, we were able to fit 
F on the timeline after deflating operation. In this op- 
eration, we keep trying to find a best-fit timeslot for the 
packet by increasing its rate. This process continues till 
we find a slot to fit the packet, or we exhaust all rates 
without being able to fit the packet on the timeline. This 
would imply the inability to schedule the packet in the 
current round. We show this in Figure 2(d). Packet G 
could not be placed in the timeline even at the highest 
PHY rate and hence, had to be left out from current iter- 
ation. 

Case-IV (Compensating for rate optimization): 
Packets with only a few intended recipients(say, ones 
with bad channel quality) would have lower value. Thus, 
such packets would potentially be transmitted at higher 
rates. This might lead to drastic degradation in received 
video at clients with bad channel quality. To remedy this, 
we carry out a round of rate re-assignment before send- 
ing out the packets. We call this operation J/nflate as it 
decreases the rate of some packets, thus, increasing their 
size on the transmission timeline. Note that the inflate 
process does not increase the size of the timeline itself. 
Inflating is carried out by going through the list of active 
clients and calculating the expected distortion in video 
quality, they would suffer if the packets are sent out ac- 
cording to current plan. In case the expected quality of a 
client falls below a certain minimum threshold, we find 
out the packet which can increase the expected quality of 
reception the most and we decrease the rate of transmis- 
sion of the packet. We then try and compensate by deflat- 
ing a few other packets which would minimally decrease 
the quality of video at clients. This process is illustrated 
in Figure 2(e). In this example packet F’ is deemed a 
valuable packet for a client with poor channel quality and 
hence, its rate is reduced, while the transmission rates of 
D and F are increased as a compensation. 


Note that Inflate, might lead to overall reduction in 
quality of video received over all clients. We keep it in 
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order to ensure that a minimal quality of video is served 
to each client. 

We would like to note that once this order has been 
determined, we do not delay the transmission of pack- 
ets until the scheduled timeslot. Instead the packets are 
sent out at the next transmission opportunity. This en- 
sures that we get even more opportunities to retransmit 
the packets before its deadline expires. The virtual time- 
line and time slots are, thus, used only to determine the 
order of transmissions and the corresponding rates. 

An interesting aspect of the PHY rate selection using 
Inflate-Deflate is that many packets can get transmitted 
at distinct rates based on the rate assignment algorithm. 
As a consequence, the proxy can get feedback on a large 
range of PHY rates from the clients, without having to 
explicitly raise the base PHY rate. This is another dif- 
ference between the rate adaptation and error rate esti- 
mation technique employed by Medusa from unicast rate 
adaptation techniques such as SampleRate [8]. 


2.4 Re-transmission planning 


We discuss the three inter-related components of re- 
transmission planning next. 

Timeout estimation: A key issue in planning for re- 
transmissions is to determine the timeouts accurately — 
under-estimating would lead to redundant packet trans- 
mission, while over-estimation would lead to video pack- 
ets missing their playback deadline. Since each client 
reception report acknowledges a block of packets, we 
have to adjust the round trip time and the timeout cal- 
culations, to account for additional delays incurred in 
clients. We adopt a TCP-like Exponentially Weighted 
Moving Average (EWMA) mechanism for RTT estima- 
tion, which takes into account this change. Furthermore, 
we re-compute the value for all packets that become eli- 
gible for re-transmissions, and use this new value to de- 
termine the packet transmission ordering and PHY rate. 

Network coded re-transmissions: As packet errors at 
different locations occur independently, multiple clients 
would potentially (not) receive different packets from a 
set of consecutive transmissions. This allows us to de- 
ploy a simple XOR-based coding [23] of packets to be 
re-transmitted, to further optimize channel utilization. In 
our system, we XOR-code a group of packets, only if 
they satisfy the following rule: Out of a set of packets to 
be re-transmitted, if a subset of packets can be found such 
that each intended recipient of a specific packet has re- 
ceived all other packets in the subset , then the subset can 
be network coded. Such coding opportunities occur fre- 
quently in the proxy, as MAC-layer re-transmissions are 
not used in Medusa . The algorithm for network coded 
re-transmissions 1s shown in Algorithm 2.1. 

A key decision in our design is to determine the set 
of packets to be coded after the packet order has been 
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Algorithm 2.1: NETCODE(P) 


INPUT P: set of coding candidates, 
arranged in decreasing order of packet values 
Coding-_set: Set of packets to be coded 
P;.client_set: Set of clients interested in P; 
OUTPUT S: set of coded packets 
for each P, € P 

do Coding_set — ¢ 
foreach P; € P,j > 1 

d . is_coding-worthy(P;, Coding_set) = true 

then Coding-set — P; 

if Coding_set # @ 
X — make-_coded_packet(P;, Coding-set) 
X.rate — P;.rate 
Sex 

else S — P; 
return (S) 
procedure IS_CODING_WORTHY(P;, Coding_set) 

for each C;, € Coding-_set 

d . FP clientset (VC pclieni-ser =o 
do return ( false ) 
return ( true ) 


then 


decided. This is done to keep the packet ordering algo- 
rithm simple, as otherwise the algorithm would have to 
deal with coded packets (with multiple constituent pack- 
ets of different values), while deciding the sending order 
and rate. A coded packet is always transmitted with the 
intended PHY rate of the first packet in the set. This en- 
sures that the probability of error in receiving the first 
packet at its intended receivers is not hampered, while 
opportunistically delivering other packets in the coded 
set to their respective clients. At the client side all re- 
ceived packets (natively or from network coded packets) 
packets are maintained till their deadline expires. This 
is done to ensure that packets coded with previously re- 
ceived packets can be recovered. The client sends back 
acknowledgment for packets which are successfully de- 
coded as part of reception reports. 

Delayed Packet discard: The deadline for packet 
delivery shifts over the duration of a streaming session. 
We initially set it to the playback deadline of the frame, 
which is calculated using the following formula, 


Framesegno 


Deadline = Frame Rate 


+ Playback buffer size + 0, 

where, the deadline is number of seconds from the 
transmission time of the first video frame, Frameseqno 
is the sequence number of the frame. Frame Rate is the 
number of frames that the video player needs to display 
in a second. Playback buffer size is the amount of time 
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(in seconds) that the receiver can store the video before 
it needs to start decoding the frames. And, 0 is a small 
time constant added to account for initial frame delay. 

Once the playback deadline of a packet expires, we re- 
set the deadline for delivering its constituent packets to 
that of the next frame which depends on the successful 
reception of the packet for decoding. This goes on until 
the packet is delivered to all clients, or the deadlines of 
all the frames which depend on the current packet have 
expired. We drop the packet from our system at that in- 
Stant. 

Similarly, at client, we discard a packet only if its play- 
back deadline has expired and the packet is not useful for 
decoding any other frame. 


3 Putting it all together 


We have implemented the Medusa proxy and client. The 
implementation consists of about 3.5K lines of C code. 
We stream video using the Evalvid tools package [1]. 
We modified Evalvid to provide information about de- 
pendency structure of video frames, frame type of the 
generated packet and the deadline of the packet. The 
Medusa proxy runs as an application level process. We 
modified the Mad WiFi driver to carry out per-packet rate 
assignment. Per packet rate assignment is achieved by 
specifying the target rate in a header of the video packet 
and then extracting it out of the packet inside the AP’s 
driver. 

At the client, the Medusa module keeps information 
regarding number of packets received and the channel 
quality. The module passes the received video pack- 
ets to video playback software such as VLC [7] and 
MPlayer [5], for displaying. It also keeps a copy of re- 
ceived packets, till the expiry of their deadline for decod- 
ing other packets. 


4 Evaluation 


To study the performance of Medusa we have experi- 
mented with upto 25 users that are associated to a sin- 
gle AP(operating in 802.11g mode) and attempting to 
receive HD quality video from the Medusa proxy. Our 
setup consists of 30 laptops with Atheros wireless driver 
running Linux operating system. 

Wireless conditions: The experiments were done on a 
university building floor. We broadly classify our wire- 
less environment into three types: (i) Low-loss environ- 
ment - corresponding to specific client locations where 
the packet error rates were 5% or less; (11) Medium-loss 
environment - corresponding to locations where packet 
error rates were in the range of 5-15%; and (iu1) High-loss 
environment - corresponding to locations where packet 
error rates were in excess of 15%. For the set of experi- 
ments reported, an experiment location did not shift from 
one to another in the course of experiment. 


MOS Rating of video quality | PSNR range 


31-31 


25-31 


20-25 





Table 1: Table mapping the MOS based user perception 
of video quality to the PSNR range 


Video setup: We experimented with different video 
clips, in this paper we present results for the Mobile cal- 
ender video clip [3] replayed back to back to run for 2 
minutes. The video was encoded at rates of 5, 10, 15 and 
20 Mbps using FFmpeg [2] tool with H264 codec. We 
have repeated each experiment for 20 runs. For our ex- 
periments, we used a fixed playback buffer of 10 seconds 
at clients. We intend to evaluate the benefits of adaptively 
modifying the playback buffer size in future. 

Metrics: We compare the performance of different 
schemes in terms of Peak Signal-to-Noise Ratio (PSNR), 
jitter, and overall network load imparted. 

PSNR: Is a standard metric for measuring the relative 
quality of video streams [13,20]. The PSNR of a video is 
well correlated with the perceived quality of video expe- 
rienced by the user. The relationship between user per- 
ception expressed in Mean Opinion Score (MOS)and the 
PSNR range were detailed in [17,22] and are summa- 
rized in Table 1. 

Jitter: We measure the Instantaneous Packet Delay 
Variation(IPDV) [14] of received packets as a measure 
of jitter of the delivered video stream. This metric com- 
plements the PSNR metric which is oblivious to the de- 
lay and jitter of the delivered video, as it assumes the 
presence of an infinite playback buffer. High jitter value 
signifies a bad performance. 

Network load: We measure the load placed on the 
network by different schemes in terms of the a) num- 
ber of packets transmitted in air and also in terms of 
amount of air-time occupied by the packets sent by dif- 
ferent schemes. 

Compared schemes: We compare the performance of 
Medusa to the following alternate schemes. 

BDCST: This scheme uses WiFi broadcast to transmit 
packets. However, unlike normal WiFi broadcast, the 
PHY rate is chosen to maximize the video PSNR per- 
formance averaged across all clients. The PHY rate is 
selected by sending about 30 seconds of traffic at differ- 
ent rates. 

UCAST-INDIV: In this scheme we send the video 
stream to each client using isolated WiFi unicast, in se- 
quence. For example, if there are two clients, we first 
send the entire video to client 1 and then the same video 
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Figure 3: Plot showing the average per client PSNR of 
different scheme when serving 25 clients with a 20 Mbps 
video stream, under medium loss conditions. The mean 
and the variance (errorbars) are shown. 


separately to client 2 using WiFi unicast. This scheme 
gives a quality bound for Medusa. 

UCAST-SIMUL: In this scheme we send the video traf- 
fic to all the clients simultaneously using normal WiFi 
unicast with SampleRate rate adaptation. This is the tra- 
ditional method for wireless data delivery. 

Note that in all of these alternate schemes described 
above, there is no proxy and the APs and clients are 
unmodified, i.e., the APs take PHY rate adaptation and 
packet re-transmission decisions, while clients do not 
need to send out reception reports. 

To evaluate Medusa we first look at the overall system 
performance in the multicast (Section 4.1) and the uni- 
cast (Section 4.2) cases. We then look at contribution of 
various Medusa components to the overall performance 
in Section 4.3. Specifically, we investigate benefits of 
rate adaptation in Section 4.3.1 and the performance ben- 
efits due to retransmissions in Section 4.3.2. 


4.1 Overall performance (multicast) 


We begin by evaluating the performance of the Medusa 
system in terms of its scalability for multicast traffic sce- 
narios. We do so by — increasing number of clients and 
increasing video rates. 

Scalability in the number of clients: We compare 
how different schemes can support HD video delivery 
to a large number of co-located WiFi clients. Figure 3 
shows the performance of a highly loaded system with 
25 clients (all receiving the same 20 Mbps video stream) 
at medium loss locations. We find that Medusa performs 
close to UCAST-INDIV (difference of 3-4 dB with 25 
clients) with increasing client count, and is significantly 
superior to all other schemes. 

Also, we find that there is a graceful degradation in 
Medusa performance when the number of clients is in- 
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creased from | to 25. But even with 25 clients, the aver- 
age PSNR value is around 37 while BDCST performance 
is around 27 (a 10 dB difference). The gradual degrada- 
tion in performance of BDCST is because of the almost 
similar nature of errors experienced at each client(10- 
15% packet error). 

The performance of UCAST-SIMUL suffers as 
802.1la/g technology cannot support more than 2 
streams with 20 Mbps rate(20 + 20 = 40 Mbps net load). 

Scalability in video rate: We fix the number of clients 
to 10 and evaluate how the performance scales with in- 
creasing video rate — from 1 Mbps to 20 Mbps. We 
show the results separately for clients in good, medium 
and bad channel conditions in Figures 4(a), (b) and (c) 
respectively. 

For good channel condition we observe that UCAST- 
SIMUL quickly degrades in performance with increase in 
video rates. Even at 5 Mbps (where the aggregate load is 
expected to be 5 x 10 Mbps = 50 Mbps), a lot of packet 
losses and buffer underflows occur. BDCST performs 
better and provides a more gradual performance degra- 
dation across the different rates. However, Medusa out- 
performs both and performs identical to UCAST-INDIV. 

With worsening channel condition as shown in Fig- 
ure 4(c) the performance of all schemes suffered. An in- 
teresting observation is that the performance of UCAST- 
INDIV became worse than Medusa as the traffic load 
(video rate) increased above 15 Mbps. This is due 
to “head-of-line” blocking in AP wireless NICs in the 
UCAST-INDIV case. Essentially, when various P- or B- 
packets are encountering losses, the AP spent significant 
effort in re-transmitting these packets, while more im- 
portant I-packets waited behind. The lack of knowledge 
about the value of different packets, prevented the AP 
from devoting an appropriate amount of re-transmission 
effort for more important packets. Medusa explicitly ad- 
dresses this problem and hence, led to improved perfor- 
mance. 


4.1.1 Jitter variation of Medusa 


We present the results for Jitter(measured as IPDV) in 
Figure 5. The experiment involved 10 users. Jitter in- 
creases with an increase in the number of clients, for 
all the schemes. However, the jitter of Medusa is sig- 
nificantly lower than both BDCST and UCAST-SIMUL. 
The jitter of UCAST-SIMUL increases exponentially with 
the number of clients. This can be attributed to the fact 
that with increasing number of clients the amount of data 
necessary to be transmitted becomes more than the net- 
work capacity. This results in a cascade of video packet 
drops in AP buffers and missing of deadlines. The jit- 
ter for BDCST also grows with the number of clients, as 
the number of candidates who can loose packets has also 
increased. Also, we note that the slope of increasing jit- 
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Figure 4: Average PSNR for 10 clients averaged over 20 runs as a function of the video rate under varying channel 
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Figure 5: Average jitter experienced by 10 clients under 
medium channel conditions for 20 runs. 


ter for Medusa is lower than that of BDCST, signifying a 
more gradual increase. 


4.1.2 Induced network load 


Apart from providing video quality commensurate with 
each user’s channel quality, a good video multicast sys- 
tem should induce minimal additional network load. We 
compare the network load imparted by BDCST, UCAST- 
INDIV and Medusa in Table 2. We calculate the addi- 
tional load placed in terms of the amount of airtime occu- 
pied by the packets (product of data-rate and packet size) 
which were transmitted using the different schemes. The 
results are normalized by the amount of airtime taken by 
BDCST. Table 2 shows that Medusa has an overhead of 
4% for good channel conditions, which goes up to 30% 
under bad channel conditions. This overhead is to com- 
pensate for the 1-5% of errors that occur in good channel 
conditions. The channel induced losses go upto 15%- 
25% when the channel conditions are bad in our settings, 
forcing Medusa to inject an extra 30% traffic into the net- 
work. Hence, Medusa does not place unecessarily high 
traffic load over the network. 


Channel 
Cond. BDCST | UCAST-INDIV | Medusa 


10.12 
10.26 
11.40 





Good 1 
Medium 1 
Bad 1 


Table 2: Airtime occupied by different schemes normal- 
ized to that of BDCST for 10 clients watching a 5 Mbps 
video, averaged over 20 runs, under varying channel con- 
ditions. 


The above observation would seem to contradict with 
the fact that Medusa uses a conservative rate-adaptation 
mechanism which should significantly increase its net- 
work resource usage. However, we find that conservative 
rate-adaption while increasing the relative time occupied 
by individual packets also suffers less packet loss. Thus, 
keeping the overall network utilization low. We present 
further results in support of this statement in Section 4.3. 


4.1.3 Interaction with other traffic 


We investigate the performance of Medusa in presence 
of multiple uncorrelated traffic sources in Section 4.1.3. 
Since we do not introduce any new end-to-end conges- 
tion control mechanism in Medusa we do not present 
in depth results on the interaction of Medusa with TCP 
flows. In our experiments, introducing a Medusa flow 
without congestion control along with multiple TCP 
flows results in Medusa flow forcing the TCP flows to 
share only the residual bandwidth amongst themselves. 
We plan to implement a congestion controlled version of 
Medusa as part of our future work. We depict the impact 
of UDP flows on Medusa performance, in Figure 4.1.3. 
We vary the number of background UDP flows, each 
at 4 Mbps, and compare the behavior of Medusa operat- 
ing with a 10 Mbps video for 10 clients. The presence 
of multiple UDP streams causes a reduction in the qual- 
ity of video seen at the clients for all schemes. How- 
ever, Medusa outperforms UCAST-INDIV as the number 
of background flows is increased (around 7dB better for 
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Figure 6: Average PSNR for 10 clients averaged over 
20 runs in presence of background UDP flows under 
medium channel conditions. Video rate is 10 Mbps. 


4 background flows). We find that these gains of Medusa 
are mainly due to our intelligent packet (re)-transmission 
ordering that mitigates the head-of-line blocking prob- 
lem in UCAST-INDIV. We show this explicitly by also 
introducing a new scheme, called Medusa-noORDER, 
in which the packet re-ordering mechanism of Medusa 
is disabled. The performance of Medusa-noORDER is 
quite similar to that of UCAST-INDIV. 


4.2 Overall performance (unicast) 


To evaluate the performance of Medusa in serving multi- 
ple unicast video streams, we have increased the num- 
ber of video flows from one to four in increments of 
one. We select the client randomly from a pool of 15 
clients. Each experiment was run 20 times with a 5 Mbps 
video stream. There were other uncorrelated background 
flows (total of 5 Mbps) running during each experiment 
We plot the results of our observation in Figure 7. In 
unicast traffic settings, the gains in Medusa arrive from 
content-dependent rate selection, intelligent packet or- 
dering and re-transmissions, as well as network coded 
re-transmissions. The broadcast advantage is available 
only for these network-coded re-transmissions, and not 
for original packets. With unicast video destined to 4 
clients, the aggregate load is 20 Mbps, which is quite 
significant. Under good channel conditions, Medusa still 
delivers an average PSNR of 40 dB, which is very sim- 
ilar to UCAST-INDIV and is 9 dB greater than UCAST- 
SIMUL. This gain is even larger (18 dB) under bad chan- 
nel conditions. 


4.3. Micro-benchmarks of Medusa compo- 
nents 


We now evaluate the effect of individual design choices 
on overall system performance. We look into perfor- 
mance of rate adaptation under diverse channel condi- 
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tions in Section 4.3.1. The performance of network 
coded retransmissions is evaluated in Section 4.3.2 and 
the overall contribution of different components is sum- 
marized in Section 4.3.3. 


4.3.1 Rate adaptation in Medusa 


An important aspect of Medusa is its ability to adapt the 
PHY rate based on channel conditions. We investigate 
the performance of these mechanisms next. 

Impact of conservative base PHY rate adaptation: We 
look at the effects of using a conservative rate adaptation 
algorithm in Medusa on the overall system performance. 
We conduct experiments with a 5 Mbps video rate to 10 
clients in good, medium and bad channel conditions. We 
ran Medusa with a conservative (err_thresh = 0.02) 
and an aggressive (err_thresh = 0.18) rate adaptation 
algorithm. Here, err_thresh signifies the maximum ex- 
pected error rate which we are willing to tolerate for 
any PHY rate. We also ran the experiment with UCAST- 
INDIV. All the experiments we repeated for 20 runs. Fig- 
ure 8(a, b) show the CDF of PHY rates assigned by dif- 
ferent schemes under the good and the bad channel con- 
ditions. We observe, Medusa-conservative assigns lower 
PHY rates to packets than unicast, while the aggressive 
algorithm assigns data rates higher than the conservative 
scheme, but lower than the unicast scheme. To highlight 
the benefits of conservative rate adaptation, we plot the 
number of extra bytes transmitted, as a fraction of over- 
all video size, and the PSNR of the resulting video under 
different channel conditions when using conservative, 
aggressive and unicast rate adaptation in Figure 8(c). For 
the UCAST-INDIV we plot the number of packets aver- 
aged by number of clients present. The following obser- 
vations can be made from the plot, 


e Under good channel conditions, an aggressive as 
well as a conservative scheme would lead to similar 
number of packet losses(1%). Under such circum- 
stances, all three schemes offer similar video quality 
and send similar amount of traffic over the network. 
From Figure 8(a), we find that around 80% (74%)of 
packets were transmitted at 24 Mbps or higher rate 
in UCAST-INDIV(Medusa-aggressive), in contrast 
to only 30% form Medusa-conservative. Hence, us- 
ing an aggressive rate adaptation would had been 
beneficial in this case, as it would lead to network 
bandwidth conservation. 


e For medium and bad loss environments, Medusa- 
conservative sends around 20% and 10% packets 
at 24 Mbps or higher. UCAST-INDIV sends about 
40% and 11% packets at 24 Mbps of higher. In 
contrast, Medusa-aggressive sends about 60% and 
20% of its packets at 24 Mbps or higher. This is be- 
cause of the slowness of the feedback process which 
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Figure 7: Overall performance of Medusa for unicast-only media traffic. Upto 4 clients shown, each requesting a 
separate media stream with 5 Mbps video rate, under different channel conditions. 
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Figure 8: CDF of packet rates assigned when transmitting 5 Mbps video to 10 clients in different channel conditions 
using different rate adaptation mechanisms. The bars in plot (c) shows performance of the schemes in terms of PSNR. 
The numbers in plot (c), on top of each bar, depict the normalized extra traffic in number of bytes sent by each scheme, 
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relative to BDCST. 


makes the Medusa-aggressive algorithm slow to re- 
act to changes in channel conditions. The perfor- 
mance of the Medusa-aggressive scheme suffers be- 
cause of its inability to adapt quickly as shown in 
Figure 8(c). The number of packets transmitted by 
the conservative algorithm is around 15% less than 
Medusa-aggressive. This is expected, as the high 
threshold value ensures that we would make very 
few errors. The aggressive algorithm also leads to 
worse video quality(in PSNR) when the network re- 
souces are scarce, precisely because of their inef- 
ficient network resource usage. Worsening chan- 
nel conditions makes the difference in video qual- 
ity about 6-12 dB(Medusaconservative and UCAST- 
INDIV have an advantage over Medusa-aggressive). 


Thus, except under good channel conditions, keeping a 
conservative rate leads to better network resource utiliza- 
tion, while the quality is maximized in all conditions by 
adopting a conservative rate adaptation. 

Impact of mobility on rate adaptation: To study the ef- 
fect of mobility and its impact of adaptation mechanisms, 
we performed targeted mobility experiments, where we 
repeatedly moved one user between a high-loss and a 
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Figure 9: Adaptation of different scheme with targeted 
mobility, for video at 20 Mbps rate. 


low-loss location, while all the other clients stayed sta- 
tionary at the low-loss location. The mobile client moved 
from the low loss to the high loss location (across a wall) 
quickly, stayed there for about 4 seconds, and returned. 
We show the adaptation performance of Medusa in com- 
parison to UCAST-INDIV and BDCST in Figure 9. The 
UCAST-INDIV scheme running its MAC-layer rate adap- 
tation technique adapts the fastest. Medusa with its in- 
tent of making rate adaptation decisions (of its PHY base 
rate) slowly, adapts somewhat slower. It takes Medusa 
about 0.4 seconds to adapt to the change in channel con- 
dition for the mobile user. This occurs in both cases — 
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when the user moves away from the low loss location, 
and when it returns to the low loss location. This can be 
attributed to the higher layer reception reports and slower 
timescales in which they occur. However, once Medusa 
adapts, it provides the user with the same performance as 
the UCAST-INDIV in this case. The BDCST scheme has 
no adaptation mechanism and does not adapt when the 
user moves. 
Impact of interference on rate adaptation: We next 
study the performance of these schemes under targeted 
interference from an external 802.11 source, that was 
a hidden terminal to the Medusa clients. Figure 10(a) 
shows the relative performance of Medusa, UCAST- 
INDIV, and BDCST, when the video rate was 5 Mbps. 
The interferer used UDP to download a large file starting 
at time 2 seconds. The performance impact of this inter- 
ference is similar to that of mobility. Medusa performed 
similar to UCAST-INDIV and much superior to BDCST. 
However, it experienced a slight delay in adapting its rate 
when compared to UCAST-INDIV. 

As shown in Figure 10(b), at a video rate of 20 Mbps, 
a similar effect happens with the hidden terminal inter- 
ference. However, hidden terminal has a significantly 
greater interference impact and at this high video rate, the 
PSNR of both Medusa and UCAST-INDIV drops. Fur- 
ther, at 20 Mbps and with hidden terminal interference, 
the performance of UCAST-INDIV falls slightly below 
Medusa. Examining this performance of Medusa more 
closely, we see that at time 2.4 seconds, the inflate-deflate 
algorithm kicks in to help improve performance. The 
table in Figure 10(c) shows the number of I, P, and B 
frames that Medusa had to discard, inflate, deflate, and 
their channel occupancy time in the three phases (ini- 
tial no interference, interference starts, and inflate-deflate 
starts). 


4.3.2 Network coded re-transmissions 


We evaluate the benefits of using network coded re- 
transmissions with varying number of clients in the sys- 
tem (Figure 11). Panel (a) figure shows the percentage 
of all packet transmissions in each case that were actu- 
ally network coded. As can be seen from the plot with 
increasing number of clients the number of network cod- 
ing opportunities increases. Also, we would like to note 
that the computation overhead is never more than 1% of 
CPU time in any of our experiments. The actual perfor- 
mance gains from network coding can be seen in Fig- 
ure 11(b) which indicates the reduction in airtime load 
that occurred due to network coding opportunities. 
Finally, we evaluate the benefits of network coded re- 
transmissions under varying channel conditions. We ex- 
periment with 5 clients receiving a 5 Mbps video under 
good, medium, and bad channel conditions respectively. 
We report the coding opportunities and the airtime reduc- 
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Figure 11: Coding opportunities and percentage traf- 
fic reduction as a function of number of clients under 
medium channel condition with 5 Mbps video averaged 
over 20 runs. 


[Cha cond. [| Good | Medium | Bad_ 
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Table 3: Coding opportunities and normalized traffic in- 
jected as a function of channel condition for five clients 
with 5 Mbps video averaged over 20 runs. 


tion due to the network coded scheme in Table 3. The 
table shows that worsening channel condition leads to 
higher benefits from network coding. This is expected, 
as the number of packet losses increases as the channel 
condition becomes bad, this in turn leads to higher num- 
ber of retransmissions and thus more coding opportuni- 
ties. 

We note that using network coding also leads in im- 
proving PSNR with increasing number of clients or 
worsening channel error conditions. We do not present 
the results for sake of brevity. 


4.3.3 Component contribution 

The Medusa system employs content aware rate 
adaptation, selective retransmissions and transmission 
(re)ordering to provide quality enhancements over 
broadcast based media delivery. Figure 12 shows the 
relative contribution of different design components in 
Medusa, over and above standard WiFi broadcast. In the 
low-loss and the high-loss environments various mecha- 
nisms in Medusa (re-transmissions, rate adaptations, or- 
dering, rest — from integration of all the components). 
provides a nearly 9 and 10 dB improvement in PSNR 
over plain BDCST. 


5 Related Work 


There has been a significant amount of research in the 
area of video streaming over wireless networks, both in 
video and systems community (see [29] for a summary). 
We comment on the most related pieces in this section. 
Dynamic transcoding is a standard technique for en- 
hancing the quality of the streaming video. It involved 
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Figure 10: Adaptation of different schemes with external hidden terminal interference with video at 5 and 20 Mbps 
rate. The table shows the number of I, P, and P frames that were discarded, inflated, deflated, and the channel occu- 
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Figure 12: Performance breakdown between rate adap- 
tation and retransmission components of Medusa system 
for 10 clients averaged over 10 runs under varying chan- 
nel conditions. 


estimating the bandwidth available in the medium and 
then change the video rate itself to ensure the best qual- 
ity video that the channel can support is delivered to the 
receivers. Chou et.al. [9, 11,12] in their seminal work 
propose a rate-distortion optimization technique to adjust 
the rate of the transmitted video based on channel qual- 
ity. However, this body of work depends on the wire- 
less hardware to pick the rate at which the video is to 
be transmitted. A second set of prior work dealing with 
identifying the optimal video rate as well as the amount 
of redundancy to be added to the video stream is repre- 
sented by the [25,26]. The authors formulate a complex 
optimization problem for the same and provide heuris- 
tic algorithms which show the performance benefits of 
the designed algorithms. Such FEC based mechanisms 
are orthogonal to the set of techniques used in our work. 
In Medusa we mostly leverage understanding from such 
prior work, and tailor our solutions to the needs of WiFi- 
based media delivery and specific issues therein. 


Authors in [19] use a scalable video codec and opti- 
mally determine the amount of FEC required. This is 
a representative of a large body of literature in the area. 
This approach is, however, complementary to ours, as we 


focus on rate adaptation and re-transmission based tech- 
niques for WiFi broadcasts in video delivery systems. 


In general wide-area network settings, the OxygenTV 
project [15,16] has considered performing selective end- 
to-end re-transmissions of packets based on the video 
frame type, focusing more on unicast video delivery. 
They propose the SR-RTP protocol for the such selec- 
tive retransmissions [16]. In contrast, our work explores 
various wireless link adaptation mechanisms that lever- 
age packet content information. 


In [31], authors present a measurement study different 
application-layer video streaming mechanisms in multi- 
hop wireless context. They do not explore interactions 
between the value of content to applications, and link 
adaptation mechanisms as we do in this work. In [27], 
authors present mechanisms to improve the quality of the 
video while operating in a lossy wireless environment. 
However they focus on low bit-rate video streams, while 
our solutions are stylized to deliver HD quality video in 
WiFi environments. 


The authors of [28] present an end-to-end video rate 
control protocol for mobile media streaming on Internet 
paths involving wireless links. They implement the con- 
trol functionalities in the receiver, which is charged with 
proving feedback to the server. The video server uses this 
information to change the video codecs used to match 
the available capacity of the end to end path. The pro- 
posed approach is complementary to ours, as we focus 
on adapting video delivery on the WiFi link, by making 
link adaptation decisions for WiFi transmitters. 


A recent mechanism, SoftCast [17], uses the notion of 
compressed sensing to create equal priority video pack- 
ets. This allows users to extract information proportional 
with their own channel quality. The core of this work fo- 
cuses on the complementary aspect of compressed sens- 
ing. Furthermore, SoftCast also requires changes to the 
wireless radio hardware (and the PHY layer), while our 
system makes no changes to the current 802.11 stan- 
dards. 
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Finally, DirCast [10] also design and implement a 
system for WiFi multicast. They advocate the use of 
pseudo-broadcasts in their system. However, the main 
difference between our Medusa approach and DirCast 
is that we propose a content-dependent PHY rate selec- 
tion, re-transmissions, and packet order selection. This 
is an issue that is not considered by DirCast. DirCast 
focuses on some complementary problems for the multi- 
cast case only (e.g., intelligent client-AP association de- 
cisions, FECs, etc.), and is agnostic of value of packets 
to applications. 

We believe that the main contribution of Medusa is in 
combining some understanding of packet contents with 
various WiFi link layer functions to improve the qual- 
ity of media delivery. WiFi link layer decisions, until 
now, have been considered in a mostly content-agnostic 
manner. Medusa suggests an interesting design point for 
combining application-layer information in making de- 
cisions at the link layer. 

Various other new techniques can be brought to im- 
prove performance of Medusa even further. In the future 
we therefore plan to investigate the use of other com- 
plementary but related mechanisms, such as application 
layer FEC for proactive error recovery, and a congestion 
control mechanism for co-existence with TCP flows. 


6 Conclusions 


Media delivery over wireless systems is a growing area 
of importance. We present the design and implementa- 
tion of the Medusa system which allows efficient deliv- 
ery of high quality media to one or more WiFi clients. 
The key contribution of this work is in recognizing that 
certain link layer functions, e.g., re-transmissions, PH Y 
rate selection, packet transmission order, can be imple- 
mented better by having some knowledge about the value 
of packets to applications. In order to be minimally inva- 
Sive to existing systems, we implement this function in 
a proxy. Our results indicate that our collection of tech- 
niques can facilitate HD video delivery of 20 Mbps to 25 
clients while maintaining a good viewing quality. 
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Abstract 


Partial packet recovery protocols attempt to repair cor- 
rupted packets instead of retransmitting them in their en- 
tirety. Recent approaches have used physical layer con- 
fidence estimates or additional error detection codes em- 
bedded in each transmission to identify corrupt bits, or 
have applied forward error correction to repair without 
such explicit knowledge. In contrast to these approaches, 
our goal is a practical design that simultaneously: (a) re- 
quires no extra bits in correct packets, (b) reduces recov- 
ery latency, except in rare instances, (c) remains compat- 
ible with existing 802.11 devices by obeying timing and 
backoff standards, and (d) can be incrementally deployed 
on widely available access points and wireless cards. 

In this paper, we design, implement, and evaluate 
Maranello, a novel partial packet recovery mechanism 
for 802.11. In Maranello, the receiver computes check- 
sums over blocks in corrupt packets and bundles these 
checksums into a negative acknowledgment sent when 
the sender expects to receive an acknowledgment. The 
sender then retransmits only those blocks for which the 
checksum is incorrect, and repeats this partial retrans- 
mission until it receives an acknowledgment. Successful 
transmissions are not burdened by additional bits and the 
receiver needs not infer which bits were corrupted. We 
implemented Maranello using OpenFWWE (open source 
firmware for Broadcom wireless cards) and deployed it 
in a small testbed. We compare Maranello to alterna- 
tive recovery protocols using a trace-driven simulation 
and to 802.11 using a live implementation under various 
channel conditions. To our knowledge, Maranello is the 
first partial packet recovery design to be implemented in 
commonly available firmware. 


1 Introduction 


Partial packet recovery approaches attempt to repair cor- 
rupt packets instead of retransmitting them. Packet re- 
covery relies on the observation that packets with errors 
may have only a few, localized errors, or at least some 
salvageable, correct content. Various approaches have 
been proposed: some rely on physical layer informa- 
tion to identify likely corrupt symbols (related groups 
of bits) to be retransmitted [12], while others embed 
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block checksums into oversized frames to allow the re- 
ceiver to recognize partially correct transmissions [11]. 
Some avoid explicit knowledge and adaptively transmit 
forward error correction information that is likely to be 
sufficient to repair bit errors [14]. These approaches 
have found substantial potential in partial packet recov- 
ery, particularly when auto-rate selection mechanisms, 
which dynamically change the transmission rate to max- 
imize throughput without too many errors, may choose 
too high a rate, thus creating errored packets to be recov- 
ered. 


Motivated by the potential of these recent approaches, 
we set out to construct a partial packet recovery scheme 
using commonly available 802.11 hardware and evalu- 
ate it in live networks. The key challenge in working 
within 802.11’s typical operation is timing, in particu- 
lar, performing all acknowledgment-related computation 
within one short inter-frame space (SIFS) interval (10 us 
for 802.11b/g or 16 ps for 802.1 1a). This requirement all 
but precludes bus transfers to the driver and complex pro- 
cessing on the network devices. To be deployable today, 
partial packet recovery must exploit features available to 
programmable firmware. 


In this paper, we present Maranello, a block-based 
partial packet recovery approach implemented (primar- 
ily) in firmware for widely-available Broadcom cards. 
Maranello takes the following design decisions. We use 
block-based recovery, meaning that we identify incor- 
rect blocks of consecutive bytes for retransmission, as 
opposed to aggregating by symbol or estimating bit error 
rate. We transmit independent repair packets that contain 
only the blocks being retransmitted, in contrast to other 
approaches that may bundle repair information with sub- 
sequent transmissions to save on medium acquisition 
time. Repair packets, by being shorter, are more likely 
to arrive successfully than full size retransmissions and 
take less time to transmit, improving performance over 
802.11. Using immediate repair packets also limits the 
amount of buffering (of out of order, incomplete pack- 
ets) required at the receiver side. We use the Fletcher-32 
checksum [5] to isolate errors to individual blocks; this 
checksum is sufficient to find all single bit errors, burst 
errors in a single 16-bit block, and two-bit errors sepa- 
rated by at most 16 bits [25]. Fletcher-32 1s also efficient 
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enough to be computed block-by-block in software dur- 
ing frame reception. Finally, we exploit the deference 
stations give to acknowledgments of overheard packets: 
because stations sending acknowledgments have priority 
over the medium right after a transmission, there is time 
for a receiver to grab the medium and send prompt feed- 
back about received blocks. Through these decisions, we 
construct a partial packet recovery scheme that (a) intro- 
duces no additional bits in the common case of success- 
ful transmissions, (b) decreases recovery time after failed 
transmissions, (c) is compatible with unmodified 802.11 
devices, and (d) can be implemented on typical off-the- 
shelf hardware and deployed incrementally. 


Our goal in constructing a practical partial packet re- 
covery scheme was to permit evaluation both in simu- 
lation and on live networks. We apply two strategies. 
First, we construct a trace-driven simulation to evaluate 
the performance Maranello would have when run with 
various combinations of operating system, driver, and 
chipset, as well as the performance Maranello would 
have compared to idealized PPR [12] and ZipTx [14]. 
We study the retransmission behavior of 802.11 im- 
plementations so that we might simulate Maranello on 
each: performance improvement depends on how ag- 
gressively the existing firmware retransmits, in particu- 
lar, whether it performs proper exponential backoff and 
how it reduces transmission rate. We survey retrans- 
mission rate fallback selection schemes and show that 
Maranello increases throughput regardless of retrans- 
mission rate fallback: if the rate chosen is too high, 
Maranello may increase the delivery probability with a 
short repair packet [8]; if too low, Maranello decreases 
the time to transmit relative to retransmission. 


Our implementation permits us to evaluate Maranello 
in terms of delivered throughput and latency in realistic 
settings. We compare the link throughput of Maranello 
and that of the original 802.11 in three different environ- 
ments: an industrial research lab, a home, and a cam- 
pus office building. We show that Maranello can sig- 
nificantly improve the delivered link throughput. We 
also verify that, even in the presence of bit corruption, 
Maranello can maintain or reduce the link latency, in 
terms of the time to deliver an individual packet and re- 
ceive an acknowledgment. We also deploy Maranello 
on programmable access points running OpenWRT to 
ensure scalability and compatibility by associating both 
Maranello-enabled and unmodified 802.11 devices. Sur- 
prisingly, we find that ACK frames can be modified to 
report the feedback information of received blocks, with- 
out causing errors on coexisting unmodified 802.11 de- 
vices. 

In the following section, we present an overview of 
prior wireless error recovery mechanisms including par- 
tial packet recovery schemes and those that rely on wire- 
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less communication diversity. In Section 3, we present 
the high-level design of Maranello, show how wireless 
errors cluster enough to support block-based recovery, 
and justify the choice of Fletcher-32. In Section 4, we 
evaluate these design choices in simulation, showing the 
potential throughput gains by interpreting detailed packet 
traces. In Section 5, we implement Maranello using 
the OpenFWWFE firmware and a slightly modified driver 
within the Linux kernel. Section 6 presents performance 
comparisons collected in our testbeds using this imple- 
mentation. We offer a discussion in Section 7 and con- 
clude in Section 8. 


2 Related Work 


In this section, we classify various wireless error recov- 
ery protocols. Table 1 summarizes wireless error recov- 
ery protocols. We categorize these protocols along two 
dimensions: the main repair techniques that they employ 
and the features they provide. The main repair tech- 
niques include: 


Block checksum (Section 2.1) When transmissions fail, 
receivers can aid recovery by sending feedback 
about corrupted blocks based on the per-block 
checksums transmitted with data packets. Seda [6] 
and FRJ [11] are protocols in this category. 

Forward error correction (Section 2.2) Protocols like 
ZipTx [14] avoid explicit knowledge about where 
the error bits are and adaptively transmit error cor- 
rection bits that are likely to be sufficient to repair 
corrupted packets. 

PHY layer hints (Section 2.3) The PHY layer of GNU 
Radio systems can provide the confidence of each 
symbol’s correctness. PPR [12] and SOFT [27] ben- 
efit from this information to identify corrupt bits 
without extra error detection codes. 

Wireless communication diversity (Section 2.4) Wire- 
less packet losses are path and location dependent. 
A packet corrupted at its destination may be cor- 
rectly received by other radios, due to the broad- 
cast nature and diversity of wireless communica- 
tion. Several protocols exploit this diversity to per- 
form error recovery, such as MRD [20], SPaC [4], 
and PRO [16]. 


These error recovery protocols provide the following 
features: 


No extra bits for correct packets Most of the proto- 
cols introduce no additional bits for successful 
transmissions, except Seda and FRJ, which trans- 
mit block/segment checksums with all packets, and 
ZipTx, which sends pilot bits in each transmission. 
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No extra bits Maintain Compatible Incremental Partial Packet 
Technique Protocol for correct packets link latency with 802.11 deployment Recovery 
Maranello Vv Vv Vv Vv Vv 
Checksum Seda [6] N/A V V 
FRJ [11] Vv Vv 
FEC ZipTx [14] Vv Vv 
PHY layer PPR [12] v Vv N/A Vv 
hints SOFT [27] v Vv N/A 
MRD [20] Vv Vv Vv 
Diversity SPaC [4] v N/A N/A Vv 
PRO [16] Vv Vv Vv Vv 


Table 1: Desired behavior and functionality of wireless error recovery protocols 
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Reduce recovery latency Seda, FRJ, and ZipTx may 
increase the recovery latency by aggregating feed- 
back for a group of corrupted packets. MRD and 
SOFT may also increase the recovery latency for the 
packets that cannot be repaired by frame combining. 

Compatible with 802.11 Among the protocols de- 
signed for 802.11 wireless networks, MRD, FRJ, 
and ZipTx disable the retransmission protocol at the 
MAC layer and thus do not interoperate with native 
802.11. 

Incremental deployment Most of the protocols are 
implemented using commercial hardware, either 
802.11 cards or MICA motes, and thus can be in- 
crementally deployed on widely available wireless 
devices. In contrast, PPR and SOFT use physical 
layer information provided by GNU Radio systems. 

Partial packet recovery Protocols like PRO and SOFT 
always retransmit the entire packet when the origi- 
nal cannot be recovered. 


Table 1 shows that none of these protocols achieve all 
these features simultaneously. 


2.1 Block Checksum 


Acknowledgment frames can be extended to include 
feedback to help error recovery protocols. Seda [6] is 
a recovery mechanism designed for data streaming in 
wireless sensor networks. In Seda, a sender divides each 
packet into blocks and encodes each block with a one- 
byte sequence number and a (one-byte) CRC-8 for er- 
ror detection. A receiver, after receiving several pack- 
ets, will test the block-level CRC-8’s for packets that fail 
the CRC-32 (if any) and request retransmission of those 
blocks. 


FRJ [11] uses jumbo frames to increase wireless link 
capacity. Each jumbo frame comprises 30 segments and 
each segment has its own CRC checksum. The receivers 
can check these segment checksums to perform partial 
retransmissions when the segments are corrupted. FRJ 


uses both MAC-layer ACKs and its own ACKs. FRJ 
sends its own ACKs after 100 ms or 64 received frames. 

Unlike Seda and FRJ, Maranello introduces no extra 
bits for correctly received frames and performs retrans- 
mission immediately after corrupted frames are detected. 


2.2 Forward Error Correction 


Forward error correction codes are beneficial to error 
recovery because they do not require explicit informa- 
tion about error locations. ZipTx [14] uses a two-round 
forward error correction mechanism to repair corrupted 
packets. In the first round, the transmitter sends a small 
number of Reed-Solomon bits for a corrupted packet, 
based on the feedback provided by the receiver. If the 
receiver still cannot recover the corrupted packets using 
these parity bits, the transmitter sends more parity bits 
in the second round. If both rounds fail, the receiver re- 
quests a retransmission of the whole packet. To reduce 
the number of feedback frames, ZipTx receivers accu- 
mulate feedback information to be transmitted after re- 
ceiving eight packets or after a timeout. 

Although ZipTx increases throughput, it may also in- 
crease recovery latency. This is because it disables MAC 
layer retransmission and generates its own ACKs for a 
group of packets in the driver. As a result, the delay 
for the recovered packets may be significantly higher 
than that of the retransmitted native 802.11 packets. 
Maranello repairs corrupted packets immediately after 
transmission fails and thus can reduce recovery latency. 


2.3 PHY Layer Hints 


Error recovery protocols can benefit from physical layer 
information beyond the best guess at the received sym- 
bol, although most commercial 802.11 cards do not ex- 
pose such extra information. PPR [12] requests retrans- 
missions of only those symbols that are likely corrupted. 
PPR also provides a compact encoding of the ranges 
of bits requested for retransmission and replicates the 
preamble to a “postamble” so that receivers may recover 
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correct bits at the end of packets that lack a good pream- 
ble. PPR was implemented and evaluated on an 802.15.4 
(ZigBee) protocol stack. 


Driven by per-bit confidence from the PHY Layer, 
SOFT [27] combines several received versions of a cor- 
rupted frame to produce a correct frame. To repair pack- 
ets sent to an AP, several APs share bit confidence over 
a wired link. To repair packets sent to a client, the client 
combines per-bit confidence from a corrupted transmis- 
sion and one or more retransmissions. 


Due to performance limitations of software radio plat- 
forms, these protocols are evaluated only at low bit 
rates. In contrast, Maranello is implemented using read- 
ily available commercial 802.11 hardware, and thus it 
can be immediately realized at speed and deployed. 
We also show that Maranello provides increased perfor- 
mance even with the encodings used for high bit rates. 


2.4 Wireless Communication Diversity 


Correcting errors with wireless diversity complements 
Maranello’s packet repair. Diversity approaches attempt 
to correct packets by observing different copies of the 
same packet, either as received at different stations or 
as received in (corrupt) retransmissions. When failure 
happens, MRD [20] combines many received versions of 
a given packet at different APs, which may have error 
bits at different locations, to recreate the original packet. 
If the original packet cannot be recovered through frame 
combining, a retransmission protocol, called Request For 
Acknowledgment (RFA), is proposed to retransmit the 
whole packet. SPaC [4] exploits the spatial diversity 
of multihop wireless sensor networks to combine sev- 
eral corrupted receptions of a packet at its destination. 
These corrupted receptions may be retransmitted by dif- 
ferent neighboring nodes to repair the original transmis- 
sion. PRO [16] is an opportunistic retransmission pro- 
tocol for 802.11 wireless LANs that allows overhearing 
relay nodes to retransmit on behalf of the source node 
after they know that a transmission failed. 


Other protocols can benefit from wireless communi- 
cation diversity, but these are typically evaluated only 
by theoretical analysis or simulation study. For exam- 
ple, MRQ [24] keeps all the erroneous receptions of a 
given packet and recovers the original packet by com- 
bining these receptions. Like PRO, HARBINGER [28] 
improves the performance of Hybrid ARQ, by exploiting 
retransmitted packets from relays that overhear the com- 
munication. The approach of Choi et al. [3] uses the error 
correction bits transmitted in data packets to recover cor- 
rupted blocks. It retrieves uncorrected blocks from later 
retransmissions of the packets and combines them with 
previous blocks to recover the original packets. 
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Figure 1: Maranello reacts to packet corruption by send- 
ing a NACK when the sender awaits an ACK. The time 
to repair should decrease relative to retransmission. (D1- 


agram not to scale.) 


3 Maranello Design 


In this section, we present an overview of Maranello, de- 
scribe how it achieves the key design goals of a practical 
partial packet recovery scheme, and justify the choices 
of block-based recovery and the Fletcher-32 checksum 
computation. We analyze this design in isolation in the 
following section (4) before presenting implementation 
details (Section 5) and evaluating the implementation on 
real hardware. 


3.1 Overview 


Figure | presents an overview of the Maranello proto- 
col, compared to 802.11. When a Maranello-supporting 
device receives a frame with errors, it divides the frame 
into 64-byte blocks (the last block may be smaller) and 
computes a separate checksum for each block. Then 
it replies to the transmitter with a NACK that includes 
these checksums. It saves the corrupted original packet 
in a buffer, waiting for the sender to transmit correct 
blocks. This negative acknowledgment is sent when 
the transmitter expects to receive a positive acknowl- 
edgment. A Maranello-supporting transmitter will then 
match the receiver-supplied checksums to those of the 
original transmission and send a repair packet with only 
those blocks of the original transmission that were cor- 
rupted. Once the repair packet is received correctly, the 
receiver sends a normal 802.11 ACK. 

Devices that do not support Maranello interoperate 
easily. Unmodified senders will treat the negative ac- 
knowledgment as garbage and retransmit as normal. 
Unmodified receivers will fail to transmit a Maranello 
NACK, and cause a Maranello sender to retransmit after 
timeout. 

At very low transmission rate, the NACK for a large 
packet may be longer than other stations expect to defer 
to the acknowledgment (1.e., it may extend beyond the 
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Network Allocation Vector); if it does, we rely on carrier 
sense to inhibit collisions with the end of the NACK. 

The cases when a Maranello-specific packet are lost 
are straightforward. If a NACK is lost, the transmitter 
will retransmit the packet as in 802.11. If this retransmis- 
sion has errors, the receiver will send another NACK. Ifa 
repair packet is lost or received with errors, the receiver 
will transmit nothing. One could alter the protocol to 
send an abridged NACK to recover correct blocks from 
errored repair packets, but we expect minimal gain from 
the added complexity. 


3.2 Design Goals 


Maranello is a practical partial packet recovery design 
with four primary goals, described below. 


Require no extra bits in correct packets Maranello em- 
braces systems design principles of optimizing the com- 
mon case, successful transmission, and doing no harm 
(not increase the size or delay of retransmissions). No 
additional error checking information, beyond the exist- 
ing CRC-32, is added to normal packets. 


Reduce recovery latency Maranello ensures that recov- 
ery latency is smaller than retransmission time by using 
the time reserved for positive acknowledgments to, in the 
event a positive acknowledgment is not warranted, send 
negative acknowledgments. 

(In the unlikely event that the entire packet is cor- 
rupt, the longer NACK may require more time than an 
ACK and the retransmission of entire packet may not be 
avoided, leading to an overall increase in retransmission 
time.) 


Compatibility with existing 802.11 802.11 is widely 
deployed, cheap, and useful. To extend it requires obe- 
dience to key inter-frame spacing and backoff require- 
ments. The receiver must be able to construct and send 
a NACK before the transmitter decides to retransmit the 
entire packet, ideally immediately after the SIFS (short 
inter-frame space) interval when the transmitter expects 
an ACK. That is, the implementation must support ex- 
tremely quick computation of block checksums in order 
to respond to the sender. At the same time, a Maranello 
sender cannot send repair packets any more quickly than 
802.11 sends retransmissions: collisions are a potential 
cause of transmission error and must be addressed by 
proper exponential backoff. These two features are nec- 
essary for coexistence with 802.11 networks. 


Incremental deployability on existing hardware Wire- 
less networks are dynamic: Maranello should not require 
negotiation or, worse, ubiquitous deployment within 
a service area. By transmitting Maranello messages 
such that unmodified 802.11 devices are not confused, 
Maranello can coexist. In effect, the Maranello NACK 
is a negotiation; a Maranello station may infer that the 
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Figure 2: Shaded areas indicate bit errors. Within-packet 
(horizontal) correlations are likely due to interference or 
loss of clock synchronization; across-packet (vertical) 
correlations may be caused by subcarrier fading. 


receiver does not support Maranello if no NACKs are 
sent. (Reserved bits in the capability-information field 
of beacon and association-request frames are also avail- 
able; it is possible to negotiate protocol features when 
necessary.) Further, by implementing Maranello in the 
firmware of existing wireless cards, this partial packet 
recovery protocol can be deployed today for users just 
by updating the firmware. 


3.3. Block-Based Recovery 


Broadly speaking, a partial packet recovery approach can 
use various means for receivers to solicit retransmission 
of parts of the packet and various means for transmit- 
ters to correct those errors. Maranello sends negative ac- 
knowledgments with checksums over blocks; transmit- 
ters determine which blocks must be retransmitted and 
send repair packets in place of retransmissions. (Alter- 
nate approaches may report abstract bit error estimates, 
request retransmission of individual symbols, or piggy- 
back repair on subsequent transmissions, as described in 
Section 2.) 

Block-based recovery, however, relies on a key as- 
sumption: that errors are clustered within a packet. In 
Figures 2 and 3, we present two views of error cluster- 
ing. Figure 2 shows the positions of bit errors in 100 
packets chosen at random from the errored packets in a 
larger trace of packets. For packets with few bit errors, 
those errors are constrained within 64-byte blocks. For 
packets with many bit errors, those errors are similarly 
often bound within consecutive 64-byte blocks. 

Figure 3 plots 17,961 packets by the number of 64- 
byte blocks that would be needed to repair errors. The 
X-axis represents the fraction of corrupt packets: each 
packet occupies the same horizontal space along the axis, 
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Figure 3: 64-bit blocks required to repair corrupt pack- 
ets in a trace. Most packets having bit errors have few 
corrupt blocks; even those with many bit errors typically 
have a few correct blocks. 


sorted in ascending order of the number of bit errors ob- 
served in that packet. A stacked bar graph extends above, 
showing the fraction of those packets required by differ- 
ent numbers of blocks. At the left side of the graph, the 
dominant color represents the single block’s ability to re- 
pair all 1-bit errors (of course), 99.7% of two-bit errors, 
96% of three-bit errors, etc. This is in contrast to a ran- 
dom bit-error model in which two bit errors in a 1500- 
byte packet would have only a 4% chance of corrupting 
only one 64-byte block. At the right end of the graph, 
relatively few packets require complete retransmission. 
(This graph may underestimate the number of irrepara- 
ble transmissions; those that the hardware cannot receive 
at all would not appear.) 


3.4 Fletcher-32 


The block checksums a receiver puts into a NACK 
must be completely computed before the SIFS expires. 
One approach might be to reprogram the hardware- 
accelerated CRC-32 engine used by the device to com- 
pute whole-packet CRCs. Unfortunately, this engine 
does not appear to be programmable. Instead, we com- 
pute a different checksum, the Fletcher-32 [5] which is 
more efficiently computed on the wireless card’s micro- 
processor. Historically, the IETF considered Fletcher 
checksums as an alternative for TCP checksums [30]. 
To verify the effectiveness of Fletcher-32 to detect bit 
errors, we perform the following trace-driven simulation. 
We take the 99,118 corrupted frames from a packet trace, 
and identify error bit positions in each frame. Then, we 
apply the error patterns to randomly generated packet 
contents to construct 9,911,800 errored packets. Finally, 
we apply CRC-32 and Fletcher-32 to detect corrupted 
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blocks with 64-byte size. All the corrupted blocks can 
be detected by both CRC-32 and Fletcher-32. 

Even with the efficient Fletcher-32 checksum, the mi- 
croprocessor is still not powerful enough to compute 
each of the block checksums during the SIFS interval: 
A single checksum for a 64-byte block can take up to 4 
ts. To solve this problem, we exploit an interesting fea- 
ture of the chipset. The microprocessor, in fact, is idle 
during the reception of a frame! Instead of allowing it to 
sleep until the packet is completely received, we modify 
the firmware to copy partially received packets and be- 
gin computation of block checksums during reception of 
the next block. This approach leaves enough time at the 
end of a corrupted frame to compute the last checksum 
(aif needed) and to build the NACK. 


4 Simulation 


Before we describe and evaluate the implementation, 
we evaluate the design of Maranello in simulation. 
Maranello’s gains depend on the specified, but not 
always followed, 802.11 backoff and the unspecified 
retransmission rate fallback behavior implemented in 
802.11 drivers and chipsets. We want to see if Maranello 
improves throughput for cards (we consider both the 
manufacturer’s driver and chipset) that behave unlike 
Broadcom’s, which we implemented Maranello on. 
Each card implements a different suite of error control 
algorithms, including auto-rate selection, retransmission 
rate fallback, and backoff. 802.11’s backoff behavior 
is defined in the specification, however our observations 
and those of Bianchi et al. [2] indicate that there are many 
different interpretations of 802.11 backoff. Although the 
802.11 specification dictates backoff behavior, it leaves 
implementors to decide on auto-rate selection and re- 
transmission rate fallback. 802.11 does not contain def- 
initions for these algorithms because no algorithm will 
work in all wireless environments. For example an op- 
timistic auto-rate selection may yield higher throughput 
on some links, but may also result in many errors on oth- 
ers. Our simulated results indicate Maranello can help 
increase the throughput from optimistic rate selection. 


4.1 Maranello Increases Throughput for 
Popular 802.11 Cards 


To characterize a variety of 802.11 backoff and retrans- 
mission rate fallback policies, we observed the retrans- 
missions sent by three popular 802.11 cards. We ran 
the cards on Windows XP to observe the behavior of the 
most common driver. To analyze many instances of the 
card retransmitting its maximum number of retransmis- 
sions, we prevented the receiver from sending any ac- 
knowledgments. For each card, Figure 4 depicts the me- 
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Figure 4: Popular 802.11 cards exhibit different exponential backoff behavior (top) and retransmission rate fallback (x 
labels show the rate, bottom bar shows transmission duration). 


dian inter-retransmission delay and time to transmit for 
the observed retransmission rate fallback. For backoff, 
some cards appear to follow 802.11: Intel and Broad- 
com’s median interval between retransmissions doubles 
for each retransmission. We did not observe Atheros 
doubling the backoff window after failed retransmis- 
sions. 


Retransmission rate fallback also varies between 
cards. Each card appears to attempt a different num- 
ber of retransmission rates (Intel 4, Atheros 4, Broad- 
com 2). Atheros does not experience much loss because 
the card will eventually attempt to retransmit a packet 
at the lowest possible rate defined in 802.11. Maranello 
helps Atheros because it will increase the probability 
that transmission is successful in the first few retransmis- 
sions, eliminating or at least reducing the size of retrans- 
missions sent at the lowest bit rate. Intel retransmits at 
optimistic rates so it may need to retransmit more times 
than a card that quickly lowers the retransmission rate. 
For Intel, Maranello will help because it increases the 
probability of receiving a retransmission correctly, re- 
warding optimistic retransmission rate selection. 


4.2 Trace-Driven Simulation 


A trace-driven simulation of Maranello indicates that 
successfully retransmitting earlier increases throughput 
for several interpretations of 802.11. The simulator oper- 
ates on a trace of packets with known payloads. Knowl- 
edge of the payload provides several desirable properties: 
(1) The simulator can determine the number of corrupted 
blocks in a packet. (2) The simulator can determine if 
the repair blocks fit inside a contiguous region of correct 
bits at the beginning of a (potentially corrupted) retrans- 
mission packet. (3) Resulting from (2) the simulator can 
subtract excess retransmissions seen after a successful 
repair. Table 2 shows the speedup obtained from simulat- 
ing Maranello for the three popular cards. Intel appears 
to achieve significant gain because Maranello mitigates 





card avg throughput avgspeedup avgERR avg rate 
Atheros 8.64 1.05 0.03 15.85 
Broadcom 11.92 1.05 0.05 40.98 
Intel 8.14 1.17 0.06 33.09 


Table 2: In simulation Maranello increases throughput 
for the Intel chipset by correcting errors caused by opti- 
mistic behavior. ERR is the 64-byte block error rate. 


the errors caused by retransmitting at an optimistic rate, 
avoiding long, although standard, backoff times. 


4.3 Repair Size 


Compared to other partial packet recovery protocols, 
Maranello does not need to send significantly larger re- 
pair packets. We simulated each of the repair protocols 
(Figure 5) with traces of data packets sent from a Broad- 
com card. To vary the bit error rate of these traces, we 
changed the distance between the sender and the receiver. 
The symbol size (1-216 bits) for symbol based repair 
(PPR) corresponds to the packet bit rate. We simulated 
an ideal version of ZipTx that assumes the indexes of cor- 
rupted bits are known, so it can pick the smallest number 
of redundancy bytes for the repair. 

To repair corrupted bits, all of the repair protocols 
must send significantly more repair bits. For traces with a 
low BER, Maranello requires marginally more bits than 
the other repair protocols. ZipTx is able to retransmit 
so few bits because Reed Solomon works well when 
there are few bit errors. For corrupted packets with a 
high average BER, PPR’s symbol-based repair needs to 
transmit the least number of bits to repair the packets. 
However, symbol based repair requires additional hard- 
ware to measure the confidence of symbols. Although 
Maranello needs more bits than symbol based repair, it 
requires fewer bits than ZipTx. If the packet contains er- 
rors clustered in one block, ZipTx’s Reed Solomon will 
waste many repair bits for correct blocks because ZipTx 
chooses its coding rate based on the BER of the most 
corrupted block. 
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Figure 5: To repair corrupted packets with a high aver- 
age BER, Maranello uses fewer repair bits than ZipTx. 
For low-average-BER corrupted packets, Maranello uses 
more repair bits than the other techniques. The bitrate 
shown is the average rate chosen by minstrel. 


5 Implementation 


We implement Maranello using OpenFWWE [7] open 
firmware and b43 Linux device driver [1] for Broadcom 
chipsets. In the following, we first discuss why several 
other potential platforms are not suitable for Maranello. 
We then present the implementation details of Maranello. 


5.1 Why Other Platforms Are Unsuitable 


To use the airtime reserved for ACK frames, receivers 
must construct and send NACK frames within SIS, 
which is the defined inter-frame space between data 
packets and ACK frames [9]. Commercial 802.11 
wireless NICs implement this time-critical operation in 
firmware or hardware. 


5.1.1 Driver space of 802.11 wireless NICs 


Recently, several wireless research platforms, such as 
FlexMAC [15] and SoftMAC [21], have been proposed 
to develop new MAC protocols. They are extensions of 
the MadWifi driver [18] for Atheros chipsets which runs 
in Linux kernel space. To determine how fast an imple- 
mentation in driver space can send back NACK frames 
for corrupted frames, we perform the following exper- 
iment. When the test receiver gets a corrupted packet, 
it copies the first 100 bytes directly into a NACK frame, 
and sends it out immediately without performing backoff 
and using SIFS. From packets traced by a monitor node, 
we found that the minimum gap between the data pack- 
ets and NACK frames is higher than 70 jus. This delay is 
mainly caused by bus transfer delay and interrupt latency 
and is consistent with the measurement results in Lu et 
al. [15]. This high latency makes the driver space unsuit- 
able for the implementation of Maranello. Jitter due to 
DMA transfers makes timing too variable. 
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5.1.2 GNU Radio 


GNU Radio platforms are slow and expensive. However, 
due to their flexibility, they have attracted increasing at- 
tention from the wireless research community and there 
are 802.11 implementations for them [22, 26]. In GNU 
Radio, the wireless signal is decoded at the host machine 
and the delay, depending on the length of the packets, 
is usually higher than 1000 ps [22]. The decoder could 
be put into the FPGA (Field-Programmable Gate Array) 
on the Universal Software Radio Peripheral (USRP), but 
the FPGA is much slower than the digital signal proces- 
sor on the wireless NICs. Moreover, another challenge is 
to generate NACK frames for corrupted packets within 
SIFS, which is difficult to implement on these platforms. 


5.2 Maranello Implementation 


We first briefly introduce OpenFWWE and review the 
architecture of wireless device drivers in Linux kernel. 
Then we present the implementation of Maranello, fo- 
cusing on NACK generation and repair packet construc- 
tion, which are time-critical operations implemented in 
the firmware. Finally, we describe other operations 1m- 
plemented in the Linux driver. 


5.2.1 Background 


A microprocessor executes a typically proprietary mi- 
crocode (firmware), written in assembly language, that 
handles various operations on wireless cards. Open- 
FWWE [7] attempts to replace the proprietary firmware 
with an open source firmware for Broadcom chipsets. 
It can support almost all the 802.11 primitives in the 
2.4GHz frequency band. By changing the standard 
code path, it is possible to implement from scratch a 
completely different channel access mechanism, subject 
to a few basic hardware constraints, such as the PHY 
layer carrier sensing, the CCK and OFDM modulation 
schemes. 

To better understand how the Maranello implementa- 
tion works, we briefly review in the following the basic 
building blocks that equip the Broadcom chipset. The in- 
ternal microprocessor drives the data exchange between 
different blocks using two main paths: transmit (TX) 
and receive (RX). The firmware is built as a main loop 
that reacts on external conditions such as a new frame’s 
arrival from the air, a channel free indicator, and (pro- 
grammable) timer expiration. The basic building blocks 
include: 


TX and RX FIFO queues — The microprocessor pulls 
frames from the TX queue and moves them into the 
serializer when a transmission opportunity comes. 
On the opposite path, it moves a received frame 
from a buffer into the RX queue and raises an IRQ 
so that the host kernel can retrieve the frame. 


USENIX Association 


USENIX Association 


Internal shared memory (SHM) — The microproces- 
sor maintains several state variables which can be 
monitored or even changed by the host kernel. 

Template RAM — The microprocessor can compose an 
arbitrary frame in this memory and transmit the re- 
sulting packet as if it came from the TX FIFO. 

Internal registers and external conditions (EC) — 
The microprocessor sets these hardware registers in 
response to changes in the EC to program the radio 
interface and set up timers. 


The current Linux kernel uses mac80211 [17] for de- 
vice driver development. mac80211 is an abstraction 
layer that bridges between the kernel’s networking stack 
and almost all the low-level wireless device drivers. For 
example, the rate control algorithms are usually imple- 
mented in mac80211 and shared by all the drivers. These 
drivers then act as stage-two bridge since all the 802.11 
low level operations, such as retransmissions, acknowl- 
edgments, and virtual carrier sense, must be performed 
by either firmware or hardware, due to hard timing con- 
straints that can not be met by a host-controlled ap- 
proach. 


5.2.2 NACK generation 


As we mentioned in Section 3, to compose the NACK, 
the receiver computes block checksums for corrupted 
frames in the firmware. Due to hardware limitations, the 
Maranello block size should be a multiple of 32 bytes. 
We use 64 bytes as the block size. Longer blocks in- 
crease computation efficiency and shorten NACKs, while 
shorter blocks are parsimonious with repair bytes. In our 
experience, the 64-byte block represents a good compro- 
mise at typical rates, though we discuss possibilities for 
dynamic adjustment in Section 7. 


For some transmission rates, a Maranello NACK uses 
more airtime than a MAC ACK, which may cause prob- 
lems in the presence of hidden terminals. The size of an 
ACK frame is only 14 bytes. A Maranello NACK frame, 
based on 64-byte blocks, is at most 96 bytes longer than 
an ACK frame (4-byte checksum for each block, 24 
blocks maximum). For a Maranello link, a hidden termi- 
nal of the receiver may hear from the transmitter the net- 
work allocation vector (NAV) and the earliest time it can 
start its own transmission is DIFS, 50 ps for 802.11b/g, 
after the end of NAV (suppose its backoff time is 0). 
There will be no collision when NACK’s bit rate is higher 
than 12 Mbps. Otherwise, a transmission from the hid- 
den terminal may collide with our NACK frames, which 
causes the retransmission of the whole packet. Prelim- 
inary experiments on a hidden terminal topology, indi- 
cate that even in this scenario, enabling Maranello can 
increase overall throughput. 


5.2.3. Repair packet construction 


Maranello transmitters must handle both ACK and 
NACK frames. 


e Like 802.11, after a transmitter sends an original 
data packet or a recovery packet, it will set up an 
ACK timer. 

e If the transmitter gets an ACK frame from the re- 
ceiver, it will release the resource allocated for the 
original or recovery packets. 

e If the transmitter gets a NACK frame from the re- 
ceiver, it divides the original packets into blocks, 
computes the checksums for these blocks, and only 
retransmits the blocks whose checksums do not 
match those in the NACK. In practice, the block 
checksums are precomputed in the driver on the 
host processor. 

e After the transmitter’s ACK timer expires, and it 
does not receive a frame, but it previously attempted 
to repair the packet, it retransmits the repair packet. 
Otherwise it retransmits the whole packet 


After a transmitter gets a NACK, it compares the 
received block checksums with the locally computed 
checksums and decides which block to retransmit inside 
a repair packet. We always retransmit the first block of 
a packet, which contains the important headers of vari- 
ous layers. For a repair packet, we reuse the 8-byte LLC 
header, only for data frames, by (1) changing the first 
byte to distinguish repair packets from other packets; (2) 
using the following 3 bytes as a bitmap of retransmitted 
blocks; and (3) appending an extra checksum (CRC-32 
or Fletcher-32) in the last four bytes. The receiver uses 
this checksum, as an extra measure of safety, to verify 
that the recovered packet is correct. 

Maranello uses the same 802.11 retry limit; each re- 
pair packet will increase the retry counter by one. Also 
before transmitting repair packets, it doubles the con- 
tention window. 


5.2.4 Driver functionality 


We implement non-time-critical operations in the driver, 
including the pre-computation of block checksums at 
the transmitter, and the reconstruction of frames at the 
receiver. We compute the block checksums for data 
packets in the driver, because the CPU on the host ma- 
chine is much more powerful than the microprocessor of 
the wireless card. Checksums are sent to the firmware 
with each data packet. After the transmitter receives a 
NACK frame, its firmware can use these checksums di- 
rectly, without recomputation. Checksums computed at 
the transmitter are used only to match those in the NACK 
frames and they are not transmitted. The receiver’s driver 
combines a buffered corrupt packet with a correct recov- 
ery packet to reconstruct the original. Recovered packets 
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that cannot pass the extra Fletcher-32 checksum test are 
discarded. 


6 Evaluation 


In this section, we evaluate the throughput and latency 
performance of Maranello in implementation, isolate the 
factors that reduce recovery time, and run Maranello 
alongside unmodified 802.11 senders to ensure cooper- 
ative interaction. 

We used 802.11b/g channels 1, 6, and 11 in environ- 
ments with active APs and stations. This experimental 
approach has the advantage of injecting real-world in- 
terference and collisions as sources of packet error, but 
has the disadvantage of reducing the repeatability of ex- 
periments since contention varies. We enable auto rate 
feedback for all of the experiments and use Linux “min- 
strel” [19] as the rate control algorithm, which supports 
multiple rate retries and is the only rate control algo- 
rithm enabled in the Linux kernel 2.6.28 and above. (Our 
driver implementation is in 2.6.29-rc2.) 


6.1 Maranello Increases Link Throughput 


In the following, we show that Maranello can increase 
throughput for UDP traffic. We construct testbeds in 
three different environments: an industry research lab, 
a home, and a university building. We run Iperf [10] on 
randomly selected links from these testbeds to generate 
a CBR UDP stream to saturate the wireless channel. We 
focus on UDP to isolate link capacity from TCP dynam- 
ICs. 

We compare the throughput of Maranello and unmod- 
ified 802.11 in Figure 6. In these plots, the x-axis rep- 
resents the throughput of 802.11 and the y-axis is the 
throughput of Maranello. Each point represents a pair 
of one-minute executions of Iperf, typically separated by 
less than 15 seconds. This separation is needed because 
we reload the firmware and driver, set up wireless inter- 
faces, and initialize minstrel’s bit rate table. Each point 
belongs to a group of ten points collected from randomly 
selected sender and receiver locations. In other words, 
we collected ten points, moved the receiver or sender sta- 
tion to another location, collected ten points again, and 
repeated. These figures include 370 (industry research 
lab), 390 (home), and 1000 (university building) points. 
The position of a point indicates the apparent through- 
put gain. For example, if a point is on the line marked 
“2X”, the throughput gain is 2. We divide the points into 
5 regions based on their throughput gain and show the 
percentage of points in each region in these figures. A 
point on a line is counted in the region above that line. 

Figure 6 shows that Maranello can increase the 
throughput for UDP traffic; often by 30% or more. The 
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university building environment shows higher through- 
put gain, because of increased contention and poorer 
channel conditions, than those observed in the other en- 
vironments. There are more than 10 access points de- 
ployed for each of the 802.11b/g channels, 1, 6, and 11, 
and they are used by many people. For the other environ- 
ments, each channel usually has fewer than four access 
points and relatively few users. To estimate the variabil- 
ity in the measurement of throughput over adjacent inter- 
vals, we also compare the throughput of 802.11 with it- 
self. We pair the throughput of two consecutive runs with 
802.11 into a point. Figure 6(d) shows the results for ex- 
periments done in our office building. The uncertainty in 
the throughput of adjacent measurements of unmodified 
802.11 appears comparable to those of measurements be- 
tween 802.11 and Maranello. Put simply, Maranello does 
not appear to increase the variability in throughput per- 
formance. 


6.2 Maranello Reduces Recovery Latency 


We define latency in this context to be the interval be- 
tween when the firmware fetches the pending packet 
from the head of the TX FIFO to when an ACK is re- 
ceived. This includes the time spent inhibited by car- 
rier sense, waiting for a transmission opportunity, and 
represents the time that the device is occupied with the 
transmission of an individual frame. We randomly select 
a link, then run [perf for one minute for Maranello and 
802.11 separately to get the per-packet latency. We use 
the firmware to record the measured time directly using 
the internal board clock: a 64-bit counter incremented 
every microsecond. 


One might consider alternate definitions of latency. 
One might ignore contention and backoff time required 
by CSMA/CA; even though the card is occupied in the 
process of transmitting a packet, no signal is yet being 
transmitted. Such would be appropriate for measuring 
peak performance. Alternately, one might consider the 
time to successful delivery and ignore cases when the 
ACK is lost; the transmitter, of course, is still occupied. 


We plot the CDF of latency for packets that need re- 
transmissions in Figure 7. To make the comparison clear, 
we omit the latency for packets without retransmission, 
and we plot the latency of only one configuration of 
sender and receiver locations (other configurations are 
qualitatively similar but not composable). Maranello 
can deliver 90% of the packets that need retransmission 
within a latency of 4.16 ms. In contrast, 10% of 802.11 
recovery latencies are above 17.1 ms. The small modes 
near 16 and 32 ms for 802.11 represent low-rate retrans- 
missions: The minstrel default retransmission rate fall- 
back attempts retransmissions at the original rate twice, 
followed by | Mbit/s up to four times if need. 
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Figure 6: Maranello has a higher throughput than 802.11. Each figure compares 802.11 with Maranello in a different 
environment, or to show the uncertainty of the comparison, with 802.11 itself. Each point represents the performance 
of back-to-back one-minute UDP throughput measurements; ten points were collected for each configuration of sender 


and receiver stations. 
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Figure 7: With block-based repair, Maranello recovers 
packets faster than 802.11’s retransmissions. 


6.3 The Sources of Throughput Gain and 
Latency Reduction 


To break down the sources of performance improvement, 
we enhance the transmission status report for each packet 
with the following information: (1) whether a repair 


packet was used, (2) if used, at which attempt, and (3) 
the number of retransmitted blocks in the repair packet. 
The original report also includes (1) whether the packet 
is successfully delivered, (2) the number of attempts, (3) 
the bit rate used for the packet. With this information, 
we can calculate the delivery probability at each attempt, 
the transmission airtime and the number of transmitted 
bytes for each attempt. We run Iperf for one minute for 
10 randomly selected links and plot in Figure 8 the prob- 
ability of successful attempt for two retransmission rate 
fallback schemes: Linux “minstrel” fallback which al- 
ways uses | Mbps as fallback rate, and 2-step fallback 
which drops the bit rate selected by minstrel for the ini- 
tial transmissions by 2 steps (if possible) and uses it as 
fallback rate. The two-step fallback selection emulates 
the Broadcom driver for Windows XP (Section 4.1). In 
this figure, the x-axis is transmission attempt. The retry 
limit of Broadcom cards is 7, 1 initial transmission, and 
at most 6 retransmissions. The y-axis is the probability 
that an attempt can succeed. 


Figure 8 shows that the probability of successful re- 
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Figure 8: Maranello can successfully retransmit a packet earlier than 802.11. Each line represents a link measured 
either with 802.11 or Maranello; the probability that Maranello’s recovery packets are delivered is typically higher. 
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Figure 9: Maranello can use airtime more effectively for packet transmissions. Each line represents a link measured 
either with 802.11 or Maranello; Maranello spends more time transmitting bits not yet correctly received. 


transmission for Maranello is usually higher than that of 
802.11. Because the retransmission rate fallback does 
not budge for the first two retransmissions, the proba- 
bility of successful retransmission can be thought of as 
the conditional probability that, given a packet (or two) 
recently failed to be delivered at the chosen rate, this 
next transmission at the same rate will be delivered. Not 
surprisingly, for 802.11, this probability descends more 
steeply than for Maranello. Maranello, in contrast, can 
send shorter repair packets, which are less likely to be 
corrupted [8], even at the original bit rate. 


The delivery probability increases at the fourth attempt 
because the firmware reduces the bit rate for the last 
four attempts. The successful attempt probabilities for 
the first three attempts are more important, because most 
packets can succeed at the first two retransmissions. The 
estimate of the delivery probability for the seventh at- 
tempt (after three previous attempts at 1 Mbit/s) is un- 
certain due to the dearth of data. For example, the 7th 
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attempt that had 0.0 delivery probability of Maranello, 
only one packet was transmitted seven times. For the 
7th attempt with 1.0 delivery probability of 802.11, there 
were 5 packets transmitted 7 times and all succeeded at 
this last attempt. 

We also plot the fraction of effective time for each 
transmission attempt in Figure 9. Effective time 
represents the time spent transmitting correct blocks; 
Maranello can use airtime more effectively, because the 
correct bits in corrupted packets may be combined with 
recovery packets to reconstruct the original packets and 
the transmission time of these correct bits is effective. 


6.4 Deployment on Access Points 


To show that Maranello can increase overall network per- 
formance and does not interact poorly with unmodified 
802.11 devices, we deploy Maranello on Linksys wire- 
less routers running OpenWRT [23]. We associate two 
desktop stations, A and B, with the Maranello AP. We 
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Figure 10: With two clients sending to an AP, on average, 
Maranello increases their individual and overall through- 
put. Error bars indicate min and max for five one minute 
runs. 


run four types of experiments: A and B both running 
Maranello, both running 802.11, A running Maranello 
and B running 802.11, and vice versa. We connect a 
third station, C, to the AP using Ethernet, to act as an 
Iperf server. We do not run the Iperf server on the AP 
directly due to its limited CPU power. During a single 
one-minute experiment, A and B send UDP packets to C 
as fast as they can. Although experimenting with down- 
link traffic might be more typical of access point use, in 
that situation, that AP would be the only transmitter and 
would not show how Maranello transmitters interact with 
unmodified 802.11 transmitters. 


Figure 10 plots the throughput of these two stations us- 
ing a stacked bar graph. There are two key notes. First, 
running Maranello does not decrease the performance of 
the unmodified 802.11 station. That is, Maranello does 
not “cheat” the existing station of throughput. Second, 
when both stations run Maranello, the throughput is sig- 
nificantly increased for both stations. An interesting ob- 
servation is that it appears not to help A or B to indi- 
vidually run Maranello when in contention. (The results 
in Section 6.1 imply that each station gains individually 
when running Maranello without a persistently compet- 
ing station.) We plan to investigate this surprising result 
that Maranello is more social than selfish when compet- 
ing with an unmodified station. 


7 Discussion 


In this section, we discuss how Maranello can be com- 
plementary with frame aggregation, which is used in 
802.1 1n, and how the block size affects the performance 
of Maranello. 


7.1 Frame Aggregation and Maranello are 
Complementary 


To increase throughput, 802.1 1n reduces the 802.11 pro- 
tocol overheads, such as interframe spacing, PHY layer 
headers and acknowledgment frames, by aggregating 
data packets into jumbo frames. Aggregated packets that 
are received incorrectly are indicated in a block acknowl- 
edgment which is sent back to the transmitter. The trans- 
mitter can then send a new chunk that contains only the 
corrupt packets. Even though only part of a packet may 
have errors in it, 802.1 1n frame aggregation must retrans- 
mit whole packets: correctly received bits are wasted. 

Frame combining can improve throughput, but it also 
significantly increases latency, as senders must wait to 
aggregate enough frames to fill a jumbo frame. Block 
acknowledgments provide a complementary aggregation 
of feedback for 802.11n, where ACKs may be buffered 
together and sent as a group, similarly increasing per- 
packet latency. Maranello is complementary with these 
frame aggregation techniques because by repairing cor- 
rupted aggregated packets, Maranello can further in- 
crease link throughput. 


7.2 Optimal Block Size 


The Maranello block size is 64 bytes, primarily because 
it is the smallest multiple of 32 that can be supported 
by hardware (Section 5.2.2). A larger block size would 
increase computation efficiency somewhat and shorten 
NACKs, which may be useful at low bit rates. When the 
error rate is low, however, larger blocks may lead to re- 
pair packets with unnecessary extra bytes, wasting chan- 
nel time. 

We consider an interesting future direction of research 
to be dynamically adjusting the block size. The ideal 
block size may vary based on an estimate of wireless 
channel conditions and the bit rate chosen by the trans- 
mitter, which determines the bit rate of the acknowledg- 
ment and thus the transmission time of the NACK. When 
the NACK is transmitted at a low rate, it may be better 
for global throughput to keep NACK transmissions short 
than to be precise about the blocks in error. A similar 
tradeoff exists in the FEC systems between the coding 
rate of error correction bits and recovery efficiency. An- 
other approach to determine the optimal block size that 
we intend to explore is to use theoretical models of wire- 
less communication errors [13, 29]. 


$8 Conclusion 


In this paper, we design, implement, and evaluate 
Maranello, a practical partial packet recovery protocol 
for 802.11 wireless networks. Maranello has the follow- 
ing features simultaneously: (a) it introduces no extra 
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bits in correct transmissions, (b) it reduces recovery la- 
tency, except in rare cases, (c) it is compatible with the 
802.11 protocol, and (d) it can be incrementally deployed 
on widely available 802.11 devices. 

We implemented Maranello using OpenFWWE open 
source firmware. This implementation, and Maranello’s 
compatibility with 802.11, allowed us to test in three dif- 
ferent, live environments over heavily used 802.11b/g 
channels where contention and interference are realis- 
tic. We found significant throughput gains when running 
Maranello over 802.11 in consecutive intervals. We also 
installed Maranello on access points running OpenWRT 
to demonstrate that Maranello does not compete unfairly 
with unmodified 802.11 devices and that the processing 
requirements of Maranello do not preclude performance 
improvement. Moreover, we evaluate Maranello’s per- 
formance compared to recently-proposed recovery pro- 
tocols using a trace-driven simulation. 
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Abstract 


Traceroute is the most widely used Internet diagnos- 
tic tool today. Network operators use it to help identify 
routing failures, poor performance, and router misconfig- 
urations. Researchers use it to map the Internet, predict 
performance, geolocate routers, and classify the perfor- 
mance of ISPs. However, traceroute has a fundamental 
limitation that affects all these applications: it does not 
provide reverse path information. Although various pub- 
lic traceroute servers across the Internet provide some 
visibility, no general method exists for determining a re- 
verse path from an arbitrary destination. 

In this paper, we address this longstanding limitation 
by building a reverse traceroute system. Our system pro- 
vides the same information as traceroute, but for the re- 
verse path, and it works in the same case as traceroute, 
when the user may lack control of the destination. We 
use a variety of measurement techniques to incrementally 
piece together the path from the destination back to the 
source. We deploy our system on PlanetLab and compare 
reverse traceroute paths with traceroutes issued from the 
destinations. In the median case our tool finds 87% of 
the hops seen in a directly measured traceroute along the 
same path, versus only 38% if one simply assumes the 
path is symmetric, a common fallback given the lack of 
available tools. We then illustrate how we can use our 
reverse traceroute system to study previously unmeasur- 
able aspects of the Internet: we present a case study of 
how a content provider could use our tool to troubleshoot 
poor path performance, we uncover more than a thousand 
peer-to-peer AS links invisible to current topology map- 
ping efforts, and we measure the latency of individual 
backbone links with average error under a millisecond. 


1 Introduction 


Traceroute is a simple and widely used Internet diagnos- 
tic tool. It measures the sequence of routers from the 
source to the destination, supplemented by round-trip de- 
lays to each hop. Operators use it to investigate routing 
failures and performance problems [39]. Researchers use 
it as the basis for Internet maps [1, 22, 34], path predic- 
tion [22], geolocation [42, 14], ISP performance analy- 
sis [25], and anomaly detection [46, 19, 44, 43]. 
However, traceroute has a fundamental limitation — it 
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provides no reverse path information, despite the fact that 
policy routing and traffic engineering mean that paths 
are generally asymmetric [15]. As Richard Steenbergen, 
CTO for nLayer Communications, put it at a recent tuto- 
rial for network operators on troubleshooting, “the num- 
ber one go-to tool is traceroute,” but “asymmetric paths 
[are] the number one plague of traceroute” because “the 
reverse path itself is completely invisible” [39]. 


This invisibility hinders operators. For instance, al- 
though Google has data centers distributed around the 
world, 20% of client prefixes experience unreasonably 
high latency, even with a nearby server. In working with 
a Google group trying to improve this performance, we 
found that we would have been able to more precisely 
troubleshoot problems if we could measure the path from 
clients back to Google [21]. 


Similarly, the lack of reverse path information restricts 
researchers. Traceroute’s inability to measure reverse 
paths forces unrealistic assumptions of symmetry on sys- 
tems with goals ranging from path prediction [22], geolo- 
cation [42, 14], ISP performance analysis [25], and pre- 
fix hijack detection [46]. Recent work shows that mea- 
sured topologies miss many of the Internet’s peer-to-peer 
links [29, 16] because mapping projects [1, 22, 34] lack 
the ability to measure paths from arbitrary destinations. 

Faced with this shortcoming with the traceroute 
tool, operators and researchers turn to various limited 
workarounds. Surprisingly, network operators often re- 
sort to posting problems on operator mailing lists ask- 
ing others to issue traceroutes to help diagnosis [30, 41]. 
Public web-accessible traceroute servers hosted at vari- 
ous locations around the world provide some help, but 
their numbers are limited. Without a server in every net- 
work, one cannot know whether any of those available 
have a path similar to the one of interest. Further, they 
are not intended for the heavy load incurred by regu- 
lar monitoring. A few modern systems attempt to de- 
ploy traceroute clients on end-user systems around the 
world [34, 9], but none of them are close to allowing an 
arbitrary user to trigger an on-demand traceroute towards 
the user from anywhere in the world. 


Our goal is to address this basic restriction of tracer- 
oute by building a tool to provide the same basic infor- 
mation as traceroute — [P-address-level hops along the 
path, plus round-trip delay to each — but along the reverse 
path from the destination back to the source. We have 
implemented our reverse traceroute system and make 
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it available at http://revtr.cs.washington. 
edu. While traceroute runs as a stand-alone program 
issuing probes on its own behalf, ours is a distributed 
system comprised of a few tens to hundreds of vantage 
points, owing to the difficulty in measuring reverse paths. 
As with traceroute, our reverse traceroute tool does not 
require control of the destination, and hence can be used 
with arbitrary targets. All our tool requires of the target 
destination is an ability to respond to probes, the same re- 
quirement as standard traceroute. It does not require new 
functionality from routers or other network components. 


Our system builds a reverse path incrementally, using 
a variety of methods to measure reverse hops, and stitch- 
ing them together into a path. We combine the view of 
multiple vantage points to gather information unavailable 
from any single one. We start by measuring the paths 
from the vantage points to the source. This limited atlas 
of a few hundred or thousand routes to the source serves 
to bootstrap the rest of our measurements, allowing us to 
measure the path from an arbitrary destination by build- 
ing back the path from the destination until it intersects 
the atlas. We use three main measurement techniques 
to build backwards. First, we rely on the fact that In- 
ternet routing is generally destination-based, allowing us 
to piece together the path one hop at a time. Second, 
we employ the IP timestamp and record route options to 
identify hops along the reverse path. Third, we use lim- 
ited source spoofing — spoofing from one vantage point 
as another — to use the vantage point in the best position 
to make the measurement. This controlled spoofing al- 
lows us to overcome many of the limitations inherent in 
using IP options [36, 35, 13], while remaining safe, as the 
spoofed source address is one of our hosts. Just as many 
projects use traceroute, others have used record route and 
spoofing for other purposes. Researchers used record 
route to identify aliases and generate accurate topolo- 
gies [35], and our earlier work used spoofing to char- 
acterize reachability problems [19]. In this work, we are 
the first to show that the combination of these techniques 
can be used to measure arbitrary reverse paths. 


Experienced users realize that, while traceroute is use- 
ful, it has numerous limitations and caveats, and can be 
potentially misleading [39]. Similarly, our tool has limi- 
tations and caveats. Section 5.1 includes a thorough dis- 
cussion of how the output of our tool might differ from 
a direct traceroute from the destination to the source, as 
well as how both might differ from the actual path tra- 
versed by traffic. Just as traceroute provides little visibil- 
ity when routers do not send TTL-expired messages, our 
technique relies on routers honoring IP options. When 
our measurement techniques fail to discern a hop along 
the path, we fall back on assuming the hop is traversed 
symmetrically; our evaluation results show that, in the 
median (mean) case for paths between PlanetLab sites, 


NSDI ’10: 7th USENIX Symposium on Networked Systems Design and Implementation 


we measure 95% (87%) of hops without assuming sym- 
metry. The need to assume symmetry in cases of an unre- 
sponsive hop points to a limitation of our tool compared 
to traceroute; whereas traceroute can often measure past 
an unresponsive hop or towards an unreachable destina- 
tion, our tool must sometimes guess that it is measuring 
the proper path. 

We rely on routers to be “friendly” to our techniques, 
yet some of our techniques have the potential for abuse 
and can be tricky for novices to use without causing dis- 
turbances. As we ultimately want our tool widely used 
operationally, we have attempted to pursue our approach 
in a way that encourages continued router support. Our 
system performs measurements in a way that empha- 
sizes network friendliness, controlling probe rate across 
all measurements. We presented the work early on at 
NANOG [28] and RIPE [32] conferences, and so far the 
response from operators has been positive towards sup- 
porting our methods (including source spoofing). We be- 
lieve the goal of wide use is best served by a single, co- 
ordinated system that services requests from all users. 

We evaluate the effectiveness of our system as de- 
ployed today, though it should improve as we add van- 
tage points. We find that, in the median (mean) case 
for paths between PlanetLab sites, our technique reveals 
87% (83%) of the routers and 100% (94%) of the points- 
of-presence (PoPs), compared to a traceroute issued from 
the destination. Paths between public traceroute servers 
and PlanetLab show similar results. Because our tech- 
nique requires software at the source, our evaluation is 
limited to paths back to PlanetLab nodes we control. We 
believe our reverse traceroute system can be useful in a 
range of contexts, and we provide three illustrative exam- 
ples. We present a case study of how a content provider 
could use our tool to troubleshoot poor reverse path per- 
formance. We also uncover thousands of links at core In- 
ternet exchange points that are invisible to current topol- 
ogy mapping efforts. We use our reverse traceroute tool 
to measure link latencies in the Sprint backbone network 
with less than a millisecond of error, on average. 


2 Background 


In this section, we provide the reader some background 
on Internet routing and traceroute. 

Internet routing: First, a router generally determines 
the route on which to forward traffic based only on the 
destination. With a few caveats such as load-balancing 
and tunneling, the route from a given point towards a par- 
ticular destination is consistent for all traffic, regardless 
of its source. While certain tunnels may violate this as- 
sumption, best practices encourage tunnels that appear as 
atomic links. Second, asymmetry between forward and 
reverse paths stems from multiple causes. An AS is free 
to choose its next hop among the alternatives, whether or 
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not that leads to a symmetric route. Two adjacent ASes 
may use different peering points in the two directions due 
to policies such as early-exit/hot-potato routing. Even 
within an individual AS, traffic engineering objectives 
may lead to different paths. 

Standard traceroute tool: Traceroute measures the se- 
quence of routers from a source to a destination, with- 
out requiring control of the destination. When traceroute 
was originally developed, most paths were symmetric, 
but that assumption no longer holds. Traceroute works 
by sending a series of packets to the destination, each 
time incrementing the time-to-live (TTL) from an ini- 
tial value of one, in order to get ICMP TTL exceeded 
responses from each router on the path in turn. Each 
response will have an IP address of an interface of the 
corresponding router. Additionally, traceroute measures 
the time from the sending of each packet to the receipt of 
the response, yielding a round-trip latency to each inter- 
mediate router. Because the destination resets the TTL in 
its reply, traceroute only works in the forward direction. 

The path returned by traceroute is a feasible, but pos- 
sibly inaccurate route. First, each hop comes from a 
response to a different probe packet, and the differ- 
ent probes may take different paths for reasons includ- 
ing contemporaneous routing changes or load balancing. 
The Paris traceroute customizes probe packets to pro- 
vide consistent results across flow-based load balancers, 
as well as to systematically explore the load-balancing 
options [2]. For all our traceroutes, we use the Paris op- 
tion that measures a single consistent path. Second, some 
routers on the path may not respond. For example, some 
routers may be configured to rate-limit responses or to 
not respond at all. Third, probe traffic may be treated 
differently than data traffic. 

Despite these caveats, traceroute has proved to be ex- 
tremely useful. Essential to traceroute’s utility is its uni- 
versality, in that it does not require anything of the desti- 
nation other than an ability to respond to probe packets. 


3 Reverse Traceroute 


We seek to build a reverse path tool equivalent to tracer- 
oute. Like traceroute, ours should work universally with- 
out requiring control of a destination, and it should use 
only features available in the Internet as it exists today. 
The reverse traceroute tool should return IP addresses of 
routers along the reverse path from a destination back 
to the source, as well as the round-trip delay from those 
routers to the source. 

At a high level, the source requests a path from our 
system, which coordinates probes from the source and 
from a set of distributed vantage points to discover 
the path. First, distributed vantage points issue tracer- 
outes to the source, yielding an atlas of paths towards 
it (Fig. 1(a)). This atlas provides a rich, but limited in 


scope, view of how parts of the Internet route towards 
the source. We use this limited view to bootstrap mea- 
surement of the desired path. Because Internet routing is 
generally destination-based, we assume that the path to 
the source from any hop in the atlas is fixed (over short 
time periods) regardless of how any particular packet 
reaches that hop; once the path from the destination to 
the source reaches a hop in the atlas, we use the atlas 
to derive the remainder of the path. Second, using tech- 
niques we explain in Sections 3.1 and 3.2, we measure 
the path back from the destination incrementally until it 
intersects this atlas (Fig. 1(b)). Finally, as shown in an 
example in Section 3.3, we merge the two components 
of the path, the destination-specific part measured from 
the destination until it intersects the atlas, and the atlas- 
derived path from this intersection back to the source, to 
yield a complete path (Fig. 1(c)). 


3.1 Identify Reverse Hops with IP Options 


We use two basic measurement primitives, the Record 
Route and Timestamp IP options. While TTL values are 
reset by the destination, restricting traceroute to measur- 
ing only on the forward path, IP options are generally re- 
flected in the reply from a destination, so routers along 
both the forward and reverse path process them. We 
briefly explain how the options work: 
IP Record-route option (RR): With this option set, a 
probe records the router interfaces it encounters. The IP 
standard limits the number of recorded interfaces to 9; 
once those fill, no more interfaces are recorded. 
IP timestamp option (7S): IP allows probes to query a 
set of specific routers for timestamps. Each probe can 
specify up to four IP addresses, in order; if the probe 
traverses the router matching the next IP address that has 
yet to be stamped, the router records a timestamp. The 
addresses are ordered, so a router will not timestamp if 
its IP address is in the list but is not the next one. 

We use these options to gather reverse hops as fol- 
lows: 


e RR-Ping(S — D): As shown in Figure 2(a)), the 
source S issues an ICMP Echo Request (henceforth 
ping) probe to D with the RR option enabled. If RR 
slots remain when the destination sends its response, 
then routers on the reverse path will record some of 
that route. This allows a limited measurement of the 
reverse path, as long as the destination is fewer than 
9 hops from the source. 

e TS-Query-Ping(S — D|D,R): As shown in Fig- 
ure 2(b)), the source S' issues an ICMP ping probe 
to D with the timestamp query enabled for the or- 
dered pair of IP addresses D and R. R will record 
its timestamp only if it is encountered by the probe 
after D has stamped the packet. In other words, if S 
receives a timestamp for R, then it knows R appears 
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(a) Vantage points traceroute to S, creating 
an atlas of known paths. 







(a) S sends a record-route ping. The header 
includes slots for 9 IP addresses to be 
recorded (1). If the packet reaches D with 
slots remaining, D adds itself (2), and routers 
on the reverse path fill the remaining slots (3). 
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(b) Vantage points measure path from D un- 
til it intersects a path in the atlas. 


(b) S sends a timestamp ping, asking first for 
D to provide a stamp if encountered, then for 
R to provide one (1). If D supports the times- 
tamp option, it fills out a timestamp (2). Be- 
cause the timestamp requests are ordered, R 
only fills out a timestamp if encountered after 
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(c) Combine to yield complete path. 


Figure 1: High-level overview of the reverse traceroute technique. We explain how to measure from D back to the atlas in § 3.1- 3.2. 
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(c) Vantage point V sends a record-route 
ping to D, spoofing as S (1). D replies to 
S (2), allowing S' to discover that R is on the 
reverse path (3). We use this technique when 


S is out of record-route range of D, but V is 
close enough. 


D, necessarily on the reverse path (3). 


Figure 2: Three measurement techniques that allow us to establish that R is on the reverse path from D back to S. In 84, we give 
two techniques, akin to (c), that use spoofing to overcome limitations in timestamp support. 


on the reverse path. For our purposes, the value of 
the timestamp is meaningless; we just care whether 
or not a particular router processes the packet. Thus, 
if we guess a router on the return path, the TS option 
can confirm our hypothesis. 


We use existing network topology information — 
specifically IP-level connectivity of routers from In- 
ternet mapping efforts [22] — to determine candidate 
sets of routers for the reverse path. Routers adja- 
cent to D in the topology are potential next hops; we 
use timestamp query probes to check whether any of 
these potential next hops is on the path from D to S. 


Note that there are some caveats to using the probes 
outlined above. One is that from each vantage point, 
only a fraction of routers will be reachable within record 
route’s limit of 9 hops. Another is that some ISPs fil- 
ter and drop probes with IP options set. Further, some 
routers do not process the IP options in the prescribed 


NSDI 710: 7th USENIX Symposium on Networked Systems Design and Implementation 


manner. Fortunately, we can overcome these limitations 
in the common case by carefully orchestrating the mea- 
surements from a diverse set of vantage points. 


3.2 Spoof to Best Use Record Route 


A source-spoofed probe (henceforth referred to as a 
spoofed probe) is one in which the prober sets the source 
address in the packet to one other than its own. We use 
a limited form of spoofing, where we replace the source 
address in a probe with the “true” source of the reverse 
traceroute. This form of spoofing is an extremely power- 
ful measurement tool. When V probes D spoofing as S, 
D’s response will go to S; we refer to V as the spoofer 
and S' as the receiver. This method allows the probe to 
traverse the path from D to S' without having to traverse 
the path from S' to D and without having a vantage point 
in D’s prefix. We could hypothetically achieve a simi- 
lar probe trajectory using loose source routing (from V 
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to S, but source routed through D) [31]. However, a 
source-routed packet can be identified and filtered any- 
where along the path, and such packets are widely fil- 
tered [3], too often to be useful in our application. On the 
other hand, if a spoofed packet is not ingress filtered near 
the spoofer, it thereafter appears as a normal packet; we 
can use a source capable of spoofing to probe along any 
path. Based on our measurements to all routable prefixes, 
many routers that filter packets with the source route op- 
tion do not filter packets with the timestamp or record 
route options. This difference is likely because source 
routing can be used to violate routing policy, whereas 
the timestamp and record route options cannot. 


This arrangement allows us to use the most advanta- 
geous vantage point with respect to the particular mea- 
surement we want to perform. Our earlier system Hub- 
ble used limited spoofing to check one-way reachabil- 
ity [19]; we use it here to overcome limitations of IP 
options. Without spoofing, RR’s 9 IP address limit re- 
stricts it to being useful only when S is near the target. 
However, as shown in Figure 2(c),if some vantage point 
V is close enough to reach the target within 8 RR slots, 
then we can probe from V spoofing as S to receive IP 
addresses on the path back to S. Similarly, spoofing can 
bypass problematic ASes and machines, such as those 
that filter timestamp-enabled packets or those that do not 
correctly implement the option. 


Although spoofing is often associated with malicious 
intent, we use it in a very controlled, safe fashion. A 
node requests a reverse path measurement, then receives 
responses to probes sent by vantage points spoofing as it. 
No harm can come from causing one of our own nodes 
to receive measurement packets. This form of spoofing 
shares a purpose with the address rewriting done by mid- 
dleboxes such as NATs, controlling the flow of traffic to a 
cooperative machine, rather than with malicious spoofing 
which seeks concealment. Since some ISPs filter spoofed 
packets, we test from each host and only send further 
spoofed probes where allowed. We have been issuing 
spoofed probes for over two years without complaint. 


Roughly 20% of PlanetLab sites allow spoofing; this 
ability is not limited to PlanetLab: the Spoofer project 
tested 12,000 clients and found that 31% could send 
spoof packets [5]. Even if filtering increases, we believe, 
based on positive feedback from operators, that the value 
of our service will encourage an allowance (supported by 
router ACLs) for a small number of measurement nodes 
to issue spoofed probes using a restricted set of ports. An 
even simpler approach is to have routers rate limit these 
spoofed options packets (just as with UDP probes) and 
filter spoofed probes sent to broadcast addresses, thereby 
reducing the security concerns without diminishing their 
utility for network measurements. 


3.3. Incrementally Build Paths 


IP option-enabled probes, coupled with spoofing as S' 
from another vantage point, give us the ability to measure 
a reverse hop from D on the path back to S. We can use 
the same techniques to stitch together a path incremen- 
tally — once we know the path from D goes through R, 
we need only determine the route at R towards S when 
attempting to discover the next hop. Because Internet 
routing is generally based on the destination, each inter- 
mediate router R we find on the path can become the new 
destination for a reverse traceroute back to the source. 
Further, if R is on a path from some vantage point V 
to S, then we can infer the rest of D’s return path from 
R onward as being the same as V’s. This assumption 
holds even in cases of packet-, flow-, and destination- 
based load balancing, so long as FR balances traffic inde- 
pendently of other routers and of the source. 

Figure 3 illustrates how we can compose the above set 
of techniques to determine the reverse path from D to S, 
when we have control over S and a set of other vantage 
points (V;, V2, V3). We assume that we have a partial 
map of router-level connectivity, e.g., from a traditional 
offline mapping effort. 

We begin by having the vantage points issue tracer- 
oute probes to S' (Figures 1(a) and 3(a)). These serve as 
a baseline set of observed routes towards S that can be 
used to complete a partially inferred reverse path. We 
then issue RR-Ping(S — D) to determine if the source 
S is within 8 RR hops of the destination, 1.e., whether 
a ping probe from S can reach D without filling up its 
entire quota of 9 RR hops (Figure 3(b))!. If the source 
is within 8 RR hops of D, this probe would determine at 
least the first hop on the reverse path, with further hops 
recovered in an iterative manner. 

If the source is not within 8 hops, we determine 
whether some vantage point is within 8 RR hops of D 
(Section 4.5 describes how we do this). Let V3 be one 
such vantage point. We then issue a spoofed RR ping 
probe from V3 to D with the source address set to S (Fig- 
ure 3(c)). This probe traverses the path Vz — D— S 
and records IP addresses encountered. The probe reveals 
R, to be on the reverse path from D to S. We then iter- 
ate over this process, with the newly found reverse hop 
as the target of our probes. For instance, we next de- 
termine a vantage point that 1s within 8 RR hops of A, 
which could be a different vantage point V2. We use this 
new vantage point to issue a spoofed RR ping probe to 
determine the next hop on the reverse path (Figure 3(d)). 


'This is not quite as simple as sending a TTL=8 limited probe, be- 
cause of issues with record route implementations [35]. Some routers 
on the forward path might not record their addresses, thereby freeing up 
more slots for the reverse path, while some other routers might record 
multiple addresses or might record their address but not decrement or 
respond to TTL-limited probes. 
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(a) Vantage points traceroute to S, creating 
an atlas of known paths. 
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(d) A vantage point V 2 close to R1 sends an 
RR-Ping spoofing as S (1), discovering R2 
and 3 on the reverse path (2). 
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(b) S sends an RR-Ping to D (1), but all the 
RR slots fill along the forward path (2), so S$ 
does not learn any reverse hops in the reply. 
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(e) We use an Internet map to find routers 
adjacent to R3, then send each a TS-Query- 
Ping to verify which is on the reverse path. 
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(c) A vantage point V3 that is closer to D 
sends an RR-Ping spoofing as S (1). D 
records its address (2), and the remaining slot 
fills on the reverse path, revealing R1 (3). 


“St v2 - 
~ 0 


be at 


(f) When we intersect a known path in our 
traceroute atlas, we assume the rest of the 
path from D follows that route. 


Figure 3: Illustration of the incremental construction of a reverse path using diverse information sources. 


In some cases, a single RR ping probe may determine 
multiple hops, as in the illustration with Ry and Rs. 
Now, consider the case where neither S nor any of 
the vantage points is within 8 hops of Rs. In that case, 
we consider the potential next hops to be routers adja- 
cent to Rg in the known topology. We issue timestamp 
probes to verify whether the next hop candidates R4 and 
Rs respond to timestamp queries TS-Query-Ping(S — 
D|D,R4) and TS-Query-Ping(S — D|D,Rs5) (as 
shown in Figure 3(e)). When Ry, responds, we know that 
it is adjacent to Az in the network topology and is on the 
reverse path from Rs, and so we assume it is the next hop 
on the reverse path. We continue to perform incremen- 
tal reverse hop discovery until we intersect with a known 
path from a vantage point to S 7, at which point we con- 
sider that to be the rest of the reverse path (Figures 1(c) 
and 3(f)). Once the procedure has determined the hops in 
the reverse path, we issue pings from the source to each 
hop in order to determine the round-trip latencies. 
Sometimes, we may be unable to measure a reverse 
hop using any of our techniques, but we still want to 
provide the user with useful information. When reverse 


*Measurement techniques may discover different addresses on a 
router [36], so we determine intersections using alias data from topol- 
ogy mapping projects [20, 22, 35] and a state-of-the-art technique [4]. 
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traceroute is unable to calculate the next hop in a path, 
the source issues a standard traceroute to the last known 
hop on the path. We then assume the last link is traversed 
symmetrically, and we try to calculate the rest of the re- 
verse path from there. In Section 5.1 and Section 5.2, 
we present results that show that we usually do not have 
to assume many symmetric hops and that, even with this 
approximation, we still achieve highly accurate paths. 


4 System Implementation 


Section 3 describes how our techniques in theory would 
allow us to measure a reverse path. In this section, we 
discuss how we had to vary from that ideal description 
in response to realities of available vantage points and of 
router implementations. In addition, the following goals 
drive our system design: 


e Accuracy: It should be robust to variations in how 
options-enabled packets are handled by routers. 


e Coverage: The system should work for arbitrary des- 
tinations irrespective of ISP-specific configurations. 


e Scalability: It needs to be selective with the use of 
vantage points and introduce as little measurement 
traffic as possible. 
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(a) S sends a timestamp ping to D (1) and 
receives a reply, but D has not filled out a 
timestamp (2). 





(b) We finda V s.t., when V pings D asking 
for R’s timestamp (3), it does not receive a 
stamp (4). This response indicates that R 1s 
not on V’s path to D. 
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(c) V spoofs as S, pinging D (5), and S re- 
ceives a timestamp for R (6). Because we 
established that R is not on V’s path to D, R 
must be on the reverse path from D to S. 


Figure 4: Example of how we discover a reverse hop with timestamp even though D does not stamp, as long as it replies. 


4.1 Architecture 


Our system consists of vantage points (VPs), which is- 
sue measurements, a controller, which coordinates the 
VPs to measure reverse paths, and sources, which re- 
quest paths to them. We use a local machine at UW 
as a controller. When a VP starts up, it registers with 
the controller, which can then send it probe requests. A 
source runs our software to issue standard traceroutes, 
RR-Pings, and TS-Query-Pings, and to receive responses 
to probes from VPs spoofing as the source. However, 
it need not spoof packets itself. Currently, our source 
software only runs on Linux and (like the ICMP tracer- 
oute option traceroute -—I) requires root permis- 
sion. The controller receives requests from sources and 
combines the measurements from VPs to report reverse 
path information. While measuring a reverse traceroute, 
the controller queues up all incoming requests. When the 
ongoing measurement completes, the controller serves 
all requests in the queue as a batch, in synchronized 
rounds of probes. This design lets us carefully control 
the rate at which we probe any particular router, as well 
as to reuse measurements when a particular source re- 
quests multiple destinations. 

We use topology maps from iPlane [22]° to identify 
adjacent routers to test with TS-Query-Pings (Fig. 3(e)). 
To increase the set of possible next-hop routers, we con- 
sider the topology to be the union of maps from the pre- 
vious 2 weeks. Since we verify the reverse hops using 
option-enabled pings, stale topology information makes 
our system less efficient but does not introduce error. 


4.2 Current Deployment 


Our current deployment uses one host at each of the more 
than 200 active PlanetLab sites as VPs to build an atlas 
of traceroutes to a source (Fig. 3(a)). Over the course of 
our study, 60+ PlanetLab sites allowed spoofed probes 
at least some of the time. We employ one host at each 


3iPlane issues forward traceroutes from PlanetLab sites and tracer- 
oute servers to around 140K prefixes. 


of these sites as spoofing nodes. Routers upstream from 
the other PlanetLab sites filter spoofed probes, so we 
did not spoof from them. We also use one host at each 
of 14 Measurement Lab sites [26], most of which al- 
low spoofing. Various organizations provide public web- 
accessible traceroute servers, and we employ 1200 of 
them [3]. These nodes issue only traceroutes and cannot 
set IP options, spoof, or receive spoofed probes. We use 
them to expand the sets of known paths to our sources. 


We have currently tested our client software only 
on PlanetLab (Linux) machines. We make it avail- 
able as a demo; our website http://revtr.cs. 
washington. edu allows users to enter an IP address 
and measure the reverse path back from it to a PlanetLab 
node. Packaging the code for widespread deployment is 
future work. Because we have dozens of operators ask- 
ing to use the system, we are being patient to avoid a 
launch that does not perform up to expectations. 


4.3 Correcting for Variations in IP Options Support 


We next explain how we compensate for variations in 
support for the timestamp option. When checking if R 
is on the reverse path from D, we normally ask for both 
D’s and R’s timestamp, to force R to only stamp on the 
reverse path. However, we found that, of the addresses in 
a day’s iPlane atlas that respond to ping, 16.6% of the ad- 
dresses respond to timestamp-enabled pings, but do not 
stamp, SO we cannot use that technique to know that R 
stamped on the reverse path. Figure 4 illustrates how 
we use spoofing to address this behavior. Essentially, we 
find a VP V which we can establish does not have R on 
its path to D, then V pings D spoofing as S, asking for 
R’s timestamp (but not D’s). If S receives a stamp for 
R, it proves FR is on the reverse path from D. This tech- 
nique will not work if all vantage points have FR on their 
paths. We examined iPlane traceroutes to destinations in 
140,000 prefixes and found at least two adjacent hops for 
55% of destinations. 
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Figure 5: If a timestamp probe from S encounters a filter (1), 
we can often bypass it by spoofing as S from a different vantage 
point (2), as long as the filter is just on the forward path. 


4.4 Avoiding Probe Filters to Improve Coverage 


We next discuss techniques to improve the coverage of 
our measurements. Some networks may filter ICMP 
packets, and others filter packets with options enabled. 
In the course of measuring a reverse path, if a source at- 
tempts a TS or RR measurement and does not receive a 
response, we retry the measurement with a VP spoofing 
as the source. As seen in Figure 5, if filtering occurs only 
along the source’s forward path, and the VP’s forward 
path does not have a filter, the original source should re- 
ceive the response. 

We demonstrate the effectiveness of this approach on 
a small sample of 1000 IP addresses selected at random 
out of those in the iPlane topology known to respond 
to timestamp probes. The 1000 destinations include ad- 
dresses in 662 ASes. We chose 10 spoofing PlanetLab 
vantage points we found to receive (non-spoofed) times- 
tamp responses from the highest number of destinations, 
plus one host at each of the 209 working non-spoofing 
PlanetLab site. First, each non-spoofing node sent a se- 
ries of timestamp pings to each destination; redundant 
probes account for loss due to something other than per- 
manent filters. Of the 209 hosts, 103 received responses 
from at least 700 destinations; we dropped them from 
the experiment, as they do not experience significant 
filtering. Then, each spoofing vantage point sent 106 
timestamp pings to each destination, spoofing as each 
of the remaining PlanetLab hosts in turn. Of these, 63 
failed to receive any responses to either spoofed or non- 
spoofed probes; they are completely stuck behind filters 
or were not working. For the remaining 43 hosts, Fig- 
ure 6 shows how many destinations each host receives 
responses from, both without and with spoofing. Our re- 
sults show that some sites benefit significantly. In re- 
verse traceroute’s timestamp measurements, whenever 
the source does not receive a response, we retry with 
5 spoofers. Since some vantage points have filter-free 
paths to most destinations, we use the 5 best overall, 
rather than choosing per destination. For the nodes that 
experience widespread filtering, spoofing enables a sig- 
nificant portion to still use timestamps as part of reverse 
traceroute. As we show in Section 5.2, our timestamp 
techniques help the overall coverage of the tool. 
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Figure 6: For 43 PlanetLab nodes, the number of destina- 
tions (out of 1000) from which the node receives timestamp 
responses. The graph shows the total number of unique desti- 
nations when sending the ping directly and then when also us- 
ing 10 spoofers. The nodes are ordered by the total number of 
responding destinations. Other PlanetLab sites were tested but 
are not included in the graph: 103 did not experience significant 
filtering and 63 did not receive responses even with spoofing. 


4.5 Selective Use of Vantage Points for Scalability 


With spoofed RR, only nearby spoofers can find reverse 
hops, since each packet includes only 9 slots. Because 
many routers rate limit after only a few probes, we can- 
not send from many vantage points at once, in the hopes 
that one will prove close enough — the router might drop 
the probes from the VPs within range. Our goal is to de- 
termine which VPs are likely to be near a router before 
we probe it. Because Internet routing is generally based 
on the destination’s prefix, a VP close to one address in 
a prefix is likely close to other addresses in the prefix. 

Each day, we harvest the set of router IP addresses 
seen in the Internet atlas gathered by iPlane on the pre- 
vious day and supplement the set with a recent list of 
pingable addresses [17]. Each day, every VP issues a 
record route ping to every address in the set. For each ad- 
dress, we determine the set of VPs that were near enough 
to discover reverse hops. We use this information in two 
ways during a reverse traceroute. First, if we encounter 
one of the probed addresses, we know the nearest VP to 
use. Second, if we encounter a new address, the offline 
probes provide a hint: the group of VPs within range of 
some address in the same prefix. Selecting the minimal 
number of vantage points to use from this group is an 1n- 
stance of the well known set cover optimization problem. 
We use the standard greedy algorithm to decide which 
VPs to use for a prefix, ordered by the number of addi- 
tional addresses they cover within the prefix. 

For a representative day, Figure 7 shows the coverage 
we achieve at given numbers of VPs per prefix. Our sys- 
tem determines the covering VPs for all prefixes, but the 
graph only includes prefixes for which we probed at least 
15 addresses, as it is trivial to cover small prefixes. We 
see that, for most prefixes, we only need a small number 
of VPs. For example, in the median case, a single VP 
suffices for over 95% of addresses in the prefix, and we 
rarely need more than 4 VPs to cover the entire prefix. 
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Figure 7: For prefixes in which iPlane observed > 15 ad- 
dresses, the fraction of the addresses for which we can find 
reverse hops using RR probes from a given number of vantage 
points per prefix. Note that we only include addresses within 
range of at least one vantage point. Prefixes with few addresses 
are trivial to cover using a small number of vantage points, so 
the graph excludes them to clearly show that we still only need 
a small number for most prefixes. 


5 Evaluation 


To test how well our reverse traceroute system can deter- 
mine reverse paths, we consider evaluation settings that 
allow us to compare a reverse traceroute from D to S to 
a direct traceroute from D to S. A complete evaluation 
of the accuracy of our technique would require ground 
truth information about the path back from the destina- 
tion. Obviously, we lack ground truth for the Internet, but 
we use two datasets, one PlanetLab-based and one using 
public traceroute servers, in which we can compare to a 
traceroute from D. For the reverse traceroute, we assume 
we do not control D and must measure the path using the 
techniques described in this paper. For the direct tracer- 
oute, we do control D and can simply issue a standard 
traceroute from D to S. 


In the PlanetLab set, we employ as sources a host at 
each of 11 PlanetLab sites chosen at random from the 
spoofing nodes. As destinations, we use one host at each 
of the 200 non-spoofing PlanetLab sites that were work- 
ing. Although such a set is not representative of the en- 
tire Internet, the destinations includes hosts in 35 coun- 
tries. The measured reverse paths traversed 13 of the 14 
transit-free commercial ISPs. Previous work observed 
route load balancing in many such networks [2], provid- 
ing a good test for our techniques. 


In the traceroute server set, we employ as sources a 
host at 10 of the same PlanetLab sites (one had gone 
down in the meantime). The 1200 traceroute servers 
we utilize belong to 186 different networks (many of 
which offer multiple traceroute servers with different lo- 
cations). For each source, we choose a traceroute server 
at random from each of the 186 networks. We then is- 
sue a traceroute from the server to the PlanetLab source. 
Because in many cases we do not know the IP address 
of the traceroute server, we use the first hop along its 
path as the destination in our reverse traceroute measure- 


ments. When measuring a reverse traceroute from this 
destination back to the source, we exclude from our sys- 
tem all traceroute servers in the same network, to avoid 
providing our system with such similar paths as to make 
its task trivial. 


5.1 Accuracy 


How similar are the hops on a reverse traceroute to 
a direct traceroute from the destination back to the 
source? For the PlanetLab dataset, the Rev7R line in 
Figure 8 depicts the fraction of hops seen on the direct 
traceroute that are also seen by reverse traceroute. Fig- 
ure 9 shows the same for the traceroute server dataset. 
Note that, outside of this experimental setting, we would 
not normally have access to the direct traceroute from 
the destination. Using alias data from topology map- 
ping projects [20, 22, 35] and aliases we discover us- 
ing a state-of-the-art technique [4], we consider a tracer- 
oute hop and a reverse traceroute hop to be the same 
if they are aliases for the same router. We need to use 
alias information because the techniques may find dif- 
ferent IP addresses on the same router [36]. For ex- 
ample, traceroute generally finds the ingress interface, 
whereas record route often returns the egress or loopback 
address. However, alias resolution is an active and chal- 
lenging research area, and faulty aliases in the data we 
employ could lead us to falsely label two hops as equiva- 
lent. Conversely, missing aliases could cause us to label 
as different two interfaces on the same router. Because 
the alias sets we use are based on measurements from 
PlanetLab or similar vantage points, we likely have more 
complete alias data for our PlanetLab dataset than for our 
traceroute server dataset. 

Using the available alias data, we find that the paths 
measured by our technique are quite similar to those seen 
by traceroute. In the median (mean) case, we measure 
87% (83%) of the hops in the traceroute for the Planet- 
Lab dataset. For the traceroute server dataset, we mea- 
sure 75% (74%) of the hops in the direct traceroute, but 
28% (29%) of the the hops discovered by reverse tracer- 
oute do not appear in the corresponding traceroute. 

The figures also compare reverse traceroute to other 
potential ways of estimating the reverse path. All tech- 
niques used the same alias resolution. Researchers of- 
ten (sometimes implicitly) assume symmetry, and opera- 
tors likewise rely on forward path measurements when 
they need reverse ones. The Guess Fwd lines depict 
how many of the hops seen in a traceroute from R to 
S are also seen in a traceroute from S$ to R. In the me- 
dian (mean) case, the forward path shares 38% (39%) 
of the reverse path’s hops for the PlanetLab dataset and 
40% (43%) for the traceroute server dataset. Another 
approach would be to measure traceroutes from a set of 
vantage points to the source. Using iPlane’s PlanetLab 
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Figure 8: For reverse traceroute and techniques for approximat- 
ing the reverse path, the fraction of hops on a direct traceroute 
from the destination to the source that the technique also dis- 
covers. Uses our PlanetLab dataset (reverse paths from 200 
PlanetLab destinations back to 11 PlanetLab sources). [Key 
labels are in the same top-to-bottom order as the lines. ] 


and traceroute server measurements, the /ntersect TR line 
in Figure 8 shows how well this approach works, by as- 
suming the reverse path is the same as the forward path 
until it intersects one of the traceroutes +. No system to- 
day performs this type of path intersection on-demand 
for users. In the median (mean) case, this traceroute in- 
tersection shares 69% (67%) of the actual traceroute’s 
hops. This result suggests that simply having a few hun- 
dred or thousand traceroute vantage points is not enough 
to reliably infer reverse paths; our system uses our novel 
measurement techniques to build off these traceroutes 
and achieve much better results. 


What are the causes of differences between a reverse 
traceroute and a directly measured traceroute? A\I- 
though it is common to think of the path given by tracer- 
oute as the true path, in reality it is also subject to mea- 
surement error. In this section, we discuss reasons tracer- 
oute and reverse traceroute may differ from each other 
and/or from the true path taken. 

Assumptions of symmetry: When reverse traceroute is 
unable to identify the next reverse hop, we resort to 
assuming that hop is symmetric. These assumptions 
may lead to inaccuracies. For the PlanetLab dataset, if 
we consider only cases when we measure a complete 
path without assuming symmetry, in the median (mean) 
case reverse traceroute matches 93% (90%) of the tracer- 
oute alias-level hops. Similarly, for the traceroute server 
dataset, in the median (mean) case reverse traceroute 
finds 83% (81%) of the traceroute hops. We discuss how 
often we have to assume symmetry in Section 5.2. 
Incomplete alias information: Many of the differences 
between the paths found by reverse traceroute and tracer- 
oute are due to missing alias information. Most alias-pair 
identification relies on sending probes to the two IP ad- 
dresses and comparing the IP-IDs of the responses. For 
the PlanetLab dataset, of all the missing addresses seen 


4We omit the line from Figure 9 to avoid clutter. 
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Figure 9: For reverse traceroute and techniques for approxi- 
mating the reverse path, the fraction of hops on a direct tracer- 
oute from the destination to the source that the technique also 
discovers. Our traceroute server dataset includes reverse paths 
from servers in 186 networks back to 10 PlanetLab sources. 
[Key labels are in the same top-to-bottom order as the lines. ] 


in a traceroute that are not aliases of any hop in the cor- 
responding reverse traceroute, 88% do not allow for such 
alias resolution [4]. Similarly, of all extra addresses seen 
in some reverse traceroute that are not aliases of any hop 
in the corresponding reverse traceroute, 82% do not al- 
low for alias resolution. For the traceroute server dataset, 
75% of the missing addresses and 74% of the extra ones 
do not allow it. Even for addresses that do respond to 
alias techniques, our alias sets are likely incomplete. 


In these cases, it is possible or even likely that the two 
measurement techniques observe IP addresses that ap- 
pear different but are in fact aliases of the same router. 
To partially examine how this lack of alias information 
limits our comparison, we use 1Plane’s Point-of-Presence 
(PoP) clustering, which maps IP addresses to PoPs de- 
fined by (AS, city) pairs [22]. For many applications such 
as diagnosis of inflated latencies [21], PoP level gran- 
ularity suffices. 1Plane has PoP mappings for 71% of 
the missing addresses in the PlanetLab dataset and 77% 
of the extra ones. For the traceroute server dataset, for 
which we have less alias information, it has mappings 
for 79% of the missing addresses and 86% of the extra 
ones. Figures 8 and 9 include PoP-Level lines showing 
the fraction of traceroute hops seen by reverse traceroute, 
if we consider PoP rather than router-alias-level compar- 
ison. In the median case, the reverse traceroute includes 
all the traceroute PoPs in both graphs (mean=94%, 84%). 
If reverse traceroute were measuring a different path than 
traceroute, then one would expect PoP-level comparisons 
to differ about as much as alias-level ones. The implica- 
tion of the measured PoP-level similarity is that, when 
traceroute and reverse traceroute differ, they usually dif- 
fer only in which router or interface in a PoP the path 
traverses. As a point of comparison, Figure 9 includes a 
PoP-level version of the Guess Fwd line; in the median 
(mean) case, it includes only 60% (61%) of the PoPs; the 
paths are quite asymmetric even at the PoP granularity. 
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Load-balancing and contemporaneous path changes: 
Another measurement artifact is that traceroute and re- 
verse traceroute may uncover different, but equally valid, 
paths, either due to following different load-balanced op- 
tions or due to route changes during measurement. To 
partly capture these effects, the Next Day lines in Fig- 
ure 8 and 9 compare how many of the traceroutes’ hops 
are also on traceroutes issued the following day. For 
the PlanetLab dataset, 26% of the paths exhibit some 
router-level variation from day to day. For the tracer- 
oute dataset, 49% of paths changed at the router level 
and (not shown in the graph) 15% changed at the PoP- 
level. In a loose sense, these results suggest an upper 
bound — even the same measurement issued at a different 
time may yield a different path. 

Hidden or anonymous routers: Previous work compar- 
ing traceroute to record route paths found that 16% of IP 
addresses appear with RR but do not appear in tracer- 
outes [36, 35]. Hidden routers, such as those inside 
some MPLS tunnels, do not decrement TTL. Anonymous 
routers decrement TTL but do not send ICMP replies, ap- 
pearing as **’ in traceroute. 

Exceptional handling of options packets: Packets with 
IP options are generally diverted to a router’s route pro- 
cessor and may be processed differently than on the line 
card. For example, previous work suggests that pack- 
ets with options are load-balanced per-packet, rather than 
per-flow [35]. 

An additional source of discrepancies between the two 
techniques is that traceroute and reverse traceroute make 
different assumptions about routing. Our techniques as- 
sume destination-based routing — if the path from D to 
S passes through R, from that point on it is the same as 
R’s path to S. An options packet reports only links it ac- 
tually traversed. With traceroute, on the other hand, a 
different packet uncovers each hop, and it assumes that 
if R/ is at hop k and R2 is at hop k+/, then there is a link 
R1I-—R2. However, it does not make the same assumption 
about destination routing, as each probe uses (S,D) as the 
source and destination. These differing assumptions lead 
to two more causes of discrepancies between a traceroute 
and a reverse traceroute: 

Traceroute inferring false links: Although we use the 
Paris traceroute technique for accurate traversal of flow- 
based load balancers, it can still infer false links in the 
case of packet-based balancing [2]. These spurious links 
appear as discrepancies between traceroute and reverse 
traceroute, but in reality show a limitation of traceroute. 
Exceptions to destination-based routing: With many tun- 
nels, an option-enabled probe will see the entire tunnel as 
a single hop. With certain tunnels, however, our assump- 
tion of destination-based routing may not hold. When 
probed directly, an intermediate router inside the tunnel 
may use a path to the destination other than the one that 






































5 

a 

6 0.8 

= 

‘*S 0.6 

S 

S ia Reverse Traceroute 4S 

— % ) 

v No Timestamping ------- L “a 
3 ae No Spoofed RR --==: \ \, 
= Intersecting Traceroutes -—:-:—: ~ 3 
5 0 | ee 
ae 0 0.2 0.4 0.6 0.8 1 


Fraction of path measured without assuming symmetry 


Figure 10: For the PlanetLab dataset, the fraction of reverse 
path hops measured, rather than assumed symmetric. The 
graph includes results with subsets of the reverse traceroute 
techniques. 


continues through the tunnel. To partly capture the de- 
gree of this effect, we perform a study that eliminates 
it. From each of the 200 PlanetLab nodes used as desti- 
nations in this section, we issue both a traceroute and an 
RR ping to each of the 11 used as sources, so the RR ping 
will have the same source and destination as the tracer- 
oute (unlike with reverse traceroute’s RR probes to inter- 
mediate routers). Since the RR slots may fill up before 
the probe reaches the destination, we only check if the 
traceroute matches the portion of the path that appears in 
the RR. After alias resolution, the median fraction of RR 
hops seen in the corresponding traceroute is 0.67, with 
the other factors described in this section accounting for 
the differences. This fraction is 0.2 lower than that for re- 
verse traceroute, showing the difficulty in matching RR 
hops to traceroute hops. 


5.2 Coverage 


In Section 5.1, we noted that our paths are more accu- 
rate when our techniques succeed in measuring the en- 
tire path without having to fall back to assuming a link is 
symmetric. As seen in Figure 8, if forced to assume the 
entire path is symmetric, in the median case we would 
discover only 39% of the hops on a traceroute. In this 
section, we investigate how often our techniques are able 
to infer reverse hops, keeping us from reverting to as- 
sumptions of symmetry. Using the PlanetLab dataset, 
Figure 10 presents the results for our complete technique, 
as well as for various combinations of the components of 
our technique. The metric captured in the graph is the 
fraction of hops in the reverse traceroute that were mea- 
sured, rather than assumed symmetric. 

Reverse traceroute finds most hops without assuming 
symmetry. In the median path in the PlanetLab dataset, 
we measure 95% of hops (mean=87%), and in 80% of 
cases we are able to measure at least 78% of the path 
without assumptions of symmetric. By contrast, the 
traceroute intersection estimation technique from Fig- 
ure 8 assumes in the median that the last 25% of the 
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path is symmetric (mean=32%). Although not shown in 
the graph, the results are similar for the traceroute server 
dataset — in the median case, reverse traceroute measures 
95% of hops (mean=92%) without assuming symmetry. 
The graph also depicts the performance of our tech- 
nique if we do not use spoofed record route pings or do 
not use timestamping. In both cases, the performance 
drops off somewhat without both probing methods. 


5.3. Overhead 


We assess the overhead of our technique using the tracer- 
oute server dataset from Section 5.1, comparing the time 
and number of probes required by our system to those 
required by traceroute. The median (mean) time for one 
of the 10 PlanetLab sites to issue a traceroute to one 
of the 186 traceroute servers was 5 seconds (9.4 sec- 
onds). Using our current system, as available on http: 
//revtr.cs.washington.edu and described in 
Section 4, the median (mean) time to measure a reverse 
traceroute was 41 seconds (116.0 seconds), including the 
time to send an initial forward traceroute (to determine if 
the destination is reachable and to present a round-trip 
path at the end). We have not yet pursued improving this 
aspect of the system. In the future, we will investigate 
lowering this delay by setting more aggressive timeouts 
for flaky PlanetLab vantage points, by cutting down on 
the communication overhead, and by attempting to adapt 
our probing rate to the rate limit of the particular target. 

For each reverse traceroute measurement, our sys- 
tem sends the initial forward traceroute and a number 
of options-enabled ping packets, some of which may be 
spoofed. In cases when it is unable to determine the next 
reverse hop, it sends a forward traceroute and assumes 
the last hop is symmetric. In addition, we require tracer- 
outes to build an atlas of paths to the source, and we use 
ongoing background mapping to identify adjacencies and 
to determine which vantage points are within RR-range 
of which prefixes. 

If we ignore the probing overhead of the traceroute 
atlas and the mapping, in the median case, the only 
traceroute required is the initial one (mean=1.2 tracer- 
outes). In the median (mean) case, a reverse traceroute 
requires 2 (2.6) record route packets, plus an additional 
9 (21.2) spoofed RR packets. The median (mean) num- 
ber of non-spoofed timestamp packets is 0 (5.1), and the 
median (mean) number of spoofed timestamp packets is 
also O (6.5). The median (mean) total number of op- 
tions packets sent is 13 (35.4). As a point of compari- 
son, traceroute uses around 45 probe packets on average, 
3 for each of around 15 hops. At the end of a reverse 
traceroute, we also send 3 pings to each hop to measure 
latency. So, ignoring the creation of the various atlases, 
reverse traceroute generally requires roughly 2-3x more 
packets than traceroute. 
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132.170.3.1 UCE Orlando, FL 
198.32.155.89 FloridaNet Orlando, FL 
198.32.132.64 FloridaNet | Jacksonville, FL 
198.32.132.19 | Cox Comm. Atlanta, GA 
68.1.0.221 Cox Comm. Ashburn, VA 
Internap | Washington, DC 
Internap | Washington, DC 
Internap | Washington, DC 
Internap Miami, FL 
Internap Seattle, WA 


Traceroute giving forward path from University of 


216.52.127.8 
66.79.151.129 
66.79.146.202 
66.79.146.241 
66.79.146.129 


Table 1: 
Central Florida to 66.79. 146.129. 


The atlases represent the majority of our probe over- 
head. However, in many circumstances these atlases can 
be reused and/or optimized for performance. For exam- 
ple, if the source requests reverse paths for multiple des- 
tinations within a short period of time [45], we can reuse 
the atlas. As an optimization, we may need to only 1s- 
sue those traceroutes that are likely to intersect [22], and 
we can use known techniques to reduce the number of 
probes to generate the atlas [11]. We borrow the adja- 
cency information needed for our timestamp probes from 
an existing mapping service [22]. To determine which 
spoofing vantage points are likely within record route 
range of a destination, we regularly issue probes from 
every spoofer to a set of addresses in each prefix. In the 
future, we plan to investigate if we can reduce this over- 
head by probing only a single address within each prefix. 


6 Applications of Reverse Traceroute 


We believe many opportunities exist for improving sys- 
tems and studies using reverse traceroute. We next dis- 
cuss three such examples of how reverse traceroute can 
be used in practice. We intend these sections to illustrate 
a few ways in which one can apply our tool; they are not 
complete studies of the problems. 


6.1 Case study of debugging path inflation 


Large content providers attempt to optimize client per- 
formance by replicating their content across a geographi- 
cally distributed set of servers. A client is then redirected 
to the server to which it has minimum latency. Though 
this improves the performance perceived by clients, it 
can still leave room for improvement. Internet routes are 
often inflated [37], which can lead to round-trip times 
from a client to its nearest server being much higher than 
what they should be given the server’s proximity. Us- 
ing Google as an example, 20% of client prefixes experi- 
ence more than 50ms latency over the minimum latency 
to the prefix’s geographical region. Google wants a way 
to identify which AS is the cause of inflation, but it is 
hindered by the lack of information about reverse paths 
back to their servers from clients [21]. 
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66.79.146.129 
66.79.146.225 
137.164.130.66 
137.164.129.15 
137.164.129.34 


Seattle, WA 
Seattle, WA 

Los Angeles, CA 
Los Angeles, CA 
Palo Alto, CA 
Seattle, WA 
Chicago, IL 
Ashburn, VA 
Orlando, FL 


Internap 
Internap 
TransitRail 
TransitRail 
TransitRail 
137.164.129.2 TransitRail 
137.164.129.11 | TransitRail 
137.164.131.165 | TransitRail 
132.1703-1 UCF 
132.170.3333 UCF Orlando, FL 
Table 2: Reverse traceroute giving reverse path from 
66.79.146.129 back to University of Central Florida. The 
circuitous reverse path explains the huge RTT jump be- 
tween the last two hops on the forward path. The third 
hop, 137.164.130.66 (internap-peer.lsanca01.transitrail.net), is 
a peering point between Internap and TransitRail in L.A. 


As an illustration, we used reverse traceroute to di- 
agnose an example of path inflation. We measured the 
RTT on the path from the PlanetLab node at the Univer- 
sity of Central Florida to the IP address 66.79.146.129, 
which is in Seattle, to be 149ms. Table 1 shows the 
forward path returned by traceroute, annotated with the 
locations of intermediate hops inferred from their DNS 
names. The path has some circuitousness going from Or- 
lando to Washington via Ashburn and then returning to 
Miami. But, that does not explain the steep rise in RTT 
from 53ms to 149ms on the last segment of the path, be- 
cause a hop from Miami to Seattle is expected to only 
add 70ms to the RTT?. 

To investigate the presence of reverse path inflation 
back from the destination, we determined the reverse 
path using reverse traceroute. Table 2 illustrates the re- 
verse path, which is noticeably circuitous. Starting from 
Seattle, the path goes through Los Angeles and Palo 
Alto, and then returns to Seattle before reaching the des- 
tination via Chicago and Ashburn. We verified with a 
traceroute from a PlanetLab machine at the University 
of Washington that TransitRail and Internap connect in 
Seattle, suggesting that the inflation is due to a routing 
misconfiguration. Private communication with an op- 
erator at one of the networks confirmed that the detour 
through Los Angeles was unintentional. Without the in- 
sight into the reverse path provided by reverse traceroute, 
such investigations would not be possible by the organi- 
zations most affected by inflated routes. 


6.2 Topology discovery 


Studies of Internet topology rely on the set of available 
vantage points and data collection points. With a limited 
number available, routing policies bias what researchers 
measure. As an example, with traceroute alone, topology 


Interestingly, the latency to Ashburn seems to also be inflated on 
the reverse path. 








vi Mae 
° C=) . . . . . 
Figure 11: Example of our techniques aiding in topology dis- 
covery. With traceroutes alone, V1 and V2 can measure only 
the forward (solid) paths. If V2 is within 8 hops of D1, a record 
route ping allows it to measure the link AS3-AS2, and a record 
route ping spoofed as V1 allows it to measure AS3-AS5. 


discovery is limited to measuring forward paths from a 
few hundred vantage points to each other and to other 
destinations. Reverse traceroute allows us to expose 
many peer-to-peer links invisible to traceroute. 

Figure 11 illustrates one way in which our techniques 
can uncover links. Assume that AS3 has a peer-to-peer 
business relationship with the other ASes. Because an 
AS does not want to provide free transit, most routes 
will traverse at most one peer-to-peer link. In this ex- 
ample, traffic will traverse one of AS3’s peer links only 
if it is sourced or destined from/to AS3. VI1’s path to 
AS3 goes through AS4, and V2’s path AS3 goes through 
ASI. Topology-gathering systems that rely on traceroute 
alone [22, 1, 34] will observe the links AS/-AS3, AS4- 
AS3, and AS2-AS5. But, they will never traverse AS3- 
AS5, or AS3-AS2, no matter what destinations they probe 
(even ones not depicted). V2 can never traverse AS/-AS3- 
AS5 in a forward path (assuming standard export poli- 
cies), because that would traverse two peer-to-peer links. 
However, if V2 is within 8 hops of D/, then it can issue a 
record-route ping that will reveal AS3-AS2, and a spoofed 
record route (spoofed as V/) to reveal AS3-AS5 ©. 

Furthermore, even services like Route Views [27] and 
RIS [33], with BGP feeds from many ASes, likely miss 
these links. Typical export policies mean that only 
routers in an AS or its customers see the AS’s peer-to- 
peer links. Since Route Views has vantage points in only 
a small percentage of the ASes lower in the AS hierarchy, 
it does not see most peer links [29, 16]. 

To demonstrate how reverse traceroute can aid in 
topology mapping, we apply it to a recent study on map- 
ping Internet exchange points (IXPs) [3]. That study 
used existing measurements, novel techniques, and thou- 
sands of traceroute servers to provide [XP peering matri- 
ces that were as complete as possible. As part of the 


©Note that, because we only query for hops already known to be 
adjacent, our timestamp pings are not useful for topology discovery. 
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study, the researchers published the list of ASes they 
found to be peering at [XPs, the [XPs at which they 
peered, and the IP addresses they used in those peerings. 

We measured the reverse paths back from those IP ad- 
dresses to all PlanetLab sites. We discovered 9096 [XP 
peerings (triples of the two ASes and the [XP at which 
they peer) that are not in the published dataset, adding 
an additional 16% to the 58,534 peerings in their study. 
As one example, we increased the number of peerings 
found at the large London LINX exchange by 19%. If we 
consider just the ASes observed peering and not which 
IXP they were seen at, we found an additional 5057 AS 
links not in the 51,832 known [XP AS links, an increase 
of 10%. Of these AS links, 1910 do not appear in e1- 
ther traceroute [22] or BGP [40] topologies — besides 
not being known as [XP links, we are discovering links 
not seen in some of the most complete topologies avail- 
able. Further, of the links in both our data and UCLA’s 
BGP topology, UCLA classifies 1596 as Customer-to- 
Provider links, whereas the fact that we observed them at 
IXPs strongly suggests they are Peer-to-Peer links. AI- 
though the recent [XP study was by far the most exhaus- 
tive yet, reverse traceroute provides a way to observe 
even more of the topology. 


6.3 Measuring one-way link latency 


In addition to measuring a path, traceroute measures a 
round-trip latency for each hop. Techniques for geoloca- 
tion [42, 18], latency estimation [22], and ISP compar- 
isons [25], among others, depend on link latency mea- 
surements obtained by subtracting the RTT to either end- 
point, then halving the difference (possibly with a filter 
for obviously wrong values). This technique should yield 
fairly accurate values if routes traverse the link symmet- 
rically. However, previous work found that 88-98% of 
paths are asymmetric [15] resulting in substantial errors 
in link latency estimates [39]. More generally, the inabil- 
ity to isolate individual links is a problem when using 
network tomography to infer missing data — tomography 
works best only when the links are traversed symmetri- 
cally or when one knows both the forward and reverse 
paths traversed by the packets [10, 6]. 

A few alternatives exist for estimating link latencies 
but none are satisfactory. Rocketfuel infers link weights 
used in routing decisions [23], which may or may not re- 
flect latencies. The geographic locations of routers pro- 
vide an estimate of link latency, but such information 
may be missing, wrong, or outdated, and latency does not 
always correspond closely to geographic distance [18]. 

In this section, we revisit the problem of estimating 
link latencies since we now have a tool that provides re- 
verse path information to complement traceroute’s for- 
ward path information. Given path asymmetry, the re- 
verse paths from intermediate routers likely differ from 
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Figure 12: Error in estimating latencies for Sprint inter-PoP 
links. For each technique, we only include links for which it 
provided an estimate: 61 of 89 links using traceroute, and 74 
of 89 using reverse traceroute. Ground truth reported only to 
0.5ms granularity. 


the end-to-end traceroutes in both directions. Without re- 
verse path information from the intermediate hops back 
to the hosts, we cannot know which links a round-trip 
latency includes. Measurements to endpoints and inter- 
mediate hops yield a large set of paths, which we sim- 
plify using IP address clustering [22]. We then generate 
a set of linear constraints: for any intermediate hop R 
observed from a source S’, the sum of the link latencies 
on the path from S to R plus the sum of the link laten- 
cies on the path back from AR must equal the round-trip 
latency measured between S' and R. We then solve this 
set of constraints using least-squares minimization, and 
we also identify the bound and free variables in the solu- 
tion. Bound variables are those sufficiently constrained 
for us to solve for the link latencies, and free variables 
are those that remain under constrained. 


We evaluate our approach on the Sprint backbone net- 
work by comparing against inter-PoP latencies Sprint 
measures and publishes [38]. We consider only the di- 
rectly connected PoPs and halve the published round-trip 
times to yield link latencies we use as ground truth, for 
89 links between 42 PoPs. We observe 61 of the 89 links 
along forward traceroutes and 79 with reverse traceroute. 
We use these measurements to formulate constraints on 
the inter-PoP links, based on round-trip latencies mea- 
sured from PlanetLab nodes to the PoPs using ping. This 
set of constraints allows us to solve for the latencies of 
74 links, leaving 5 free and 10 unobserved. 


As a comparison point, we use a traditional method 
for estimating link latency from traceroutes [22]. For 
each forward traceroute that traverses a particular Sprint 
link, we sample the link latency as half the difference be- 
tween the round-trip delay to either end, then estimate 
the link latency to be the median of these samples across 
all traceroutes. Figure 12 shows the error in the latency 
estimates of the two techniques, compared to the pub- 
lished ground truth. Our approach infers link latencies 
with errors from Oms to 2.2ms for the links, with a me- 
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dian of 0.4ms and a mean of 0.6ms. Because Sprint re- 
ports round-trip delays with millisecond granularity, the 
values we use for ground truth have 0.5ms granularity, 
so our median “error” is within the granularity of the 
data. The estimation errors using the traditional tracer- 
oute method range from Oms to 22.2ms, with a median 
of 4.1ms and a mean of 6.2ms — 10% our worst-case, me- 
dian, and mean errors. Based on this initial study of a 
single large network for which we have ground-truth, us- 
ing reverse traceroute to generate and solve constraints 
yields values very close to the actual latencies, whereas 
the traditional approach does not. 


7 Related work 


Measurement techniques: Previous work concluded 
that too many paths dropped packets with IP options 
for options to form the basis of a system [13]. The 
Passenger and DisCarte projects, however, showed that 
the record route option, when set on traceroute pack- 
ets, reduces false links, uncovers more routers, and pro- 
vides more complete alias information [36, 35]. Hubble 
demonstrated the use of spoofed packets to probe a path 
in one direction without having to probe the other [19], 
but it does not determine the routers along the reverse 
path. Addressing this limitation in Hubble was part of 
the original motivation for this work. 


The contributions of these various projects is in how 
they employ existing IP techniques — options and spoof- 
ing — towards useful ends. Our work employs the same 
IP techniques in new ways. We demonstrate how spoof- 
ing with options can expose reverse paths. Whereas Pas- 
senger and DisCarte used RR to improve forward path 
information, we use RR in non-TTL-limited packets to 
measure reverse paths. As far as we are aware, our work 
is the first to productively employ the timestamp option. 


Techniques for inferring reverse path information: 
Various earlier techniques proposed methods for infer- 
ring limited reverse path information. Before such pack- 
ets were routinely filtered, one study employed loose 
source-routing [31] to measure paths from numerous re- 
mote sites. Other interesting work used return TTL val- 
ues to estimate reverse routing maps towards sources; 
however, the resulting maps contained less than half the 
actual links, as well as containing multiple paths from 
many locations [7]. PlanetSeer [43] and Hubble [19] in- 
cluded techniques for isolating failures to either the for- 
ward or reverse path; neither system, however, can give 
information about where on a reverse path the failure oc- 
curs. Netdiff inferred path asymmetry in cases where 
hop counts differ greatly in the two directions [25]; how- 
ever, aS our example in Section 6.1 shows, very asym- 
metric paths can have the same hop count. Tulip used 
ICMP timestamps (not the IP timestamp option we use) 


and other techniques to identify reordering and loss along 
either the forward or reverse path [24]. 


Systems that would benefit from reverse path infor- 
mation: Many systems seem well-designed to make use 
of reverse path information, but, lacking it, make various 
substitutions or compromises. We mention some recent 
ones here. Geolocation systems use delay and path infor- 
mation to constrain the position of targets [14, 18, 42], 
but, lacking reverse path data, are under constrained. 
iPlane shows that knowledge of a few traceroutes from 
a prefix greatly improves path predictions [22], but lacks 
vantage points in most. iSpy attempted to detect pre- 
fix hijacks using forward-path traceroutes, yet the sig- 
nature it looked for is based on the likely pattern of re- 
verse paths [46]. Similarly, intriguing recent work on 
inferring topology through passive observation of traf- 
fic bases its technique on an implicit assumption that the 
hop counts of forward and reverse paths are likely to be 
the same [12]. Similarly, systems for network monitor- 
ing often assume path symmetry [8, 25]. All these efforts 
can potentially benefit from the work described here. 


$ Conclusion 


Although widely-used and popular, traceroute is fun- 
damentally limited in that it cannot measure reverse 
paths. This limitation leaves network operators and re- 
searchers unable to answer important questions about In- 
ternet topology and performance. To solve this problem, 
we developed a reverse traceroute system to measure re- 
verse paths from arbitrary destinations back to the user. 
The system uses a variety of methods to incrementally 
build a path back from the destination hop-by-hop, un- 
til it reaches a known baseline path. We believe that our 
system makes a strong argument for both the IP times- 
tamp option and source spoofing as important measure- 
ment tools, and we hope that PlanetLab and ISPs will 
consider them valuable components of future measure- 
ment testbeds. 


Our reverse traceroute system is both effective — in 
the median case finding all of the PoPs seen by a di- 
rect traceroute along the same path — and useful. The 
tool allows operators to conduct investigations impossi- 
ble with existing tools, such as tracking down path in- 
flation along a reverse route. Many operators seem to 
view reverse traceroute as a useful tool — based on the 
results presented in this paper, we received requests to 
help us test the tool and offers of spoofing vantage points, 
including hosts at all the PoPs of an international back- 
bone network. The system’s probing methods have also 
proved useful for topology mapping. In illustrative ex- 
amples, we demonstrated how our system can discover 
more than a thousand peer-to-peer links invisible to both 
BGP route collectors and to traceroute-based mapping 
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efforts, as well as how it can be used to accurately mea- 
sure the latency of backbone links. We believe the accu- 
racy and coverage of the tool will only improve as we add 
additional vantage points. A demo of our tool is available 
athttp://revtr.cs.washington.edu. 
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Seamless BGP Migration With Router Grafting 
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Abstract 


Network operators are under tremendous pressure to 
make their networks highly reliable to avoid service dis- 
ruptions. Yet, operators often need to change the network 
to upgrade faulty equipment, deploy new services, and 
install new routers. Unfortunately, changes cause dis- 
ruptions, forcing a trade-off between the benefit of the 
change and the disruption it will cause. In this paper we 
present router grafting, where parts of a router are seam- 
lessly removed from one router and merged into another. 
We focus on grafting a BGP session and the underlying 
link—from one router to another, or between blades in 
a cluster-based router. Router grafting allows an oper- 
ator to rehome a customer with no disruption, compared 
to downtimes today measured in minutes. In addition, 
grafting a BGP session can help in balancing load be- 
tween routers or blades, planned maintenance, and even 
traffic management. We show that grafting a BGP ses- 
sion is practical even with today’s monolithic router soft- 
ware. Our prototype implementation uses and extends 
Click, the Linux kernel, and Quagga, and introduces a 
daemon that automates the migration process. 


1 Introduction 


In nature, grafting is where a part of one living organ- 
ism (e.g., tissue from a plant) is removed and fused into 
another organism. In this paper, we apply this concept 
to routers to enable new network-management capabili- 
ties which allow network changes to be made with mini- 
mal disruption. We call this router grafting. With router 
grafting, we view routers in terms of their parts and en- 
able splitting these parts from one router and merging 
them into another. This capability makes the view of the 
network a more fluid one where the topology can readily 
change, allowing operators to adapt their networks with- 
out disruption in the service offered to users. We envision 
router grafting to eventually be applicable to arbitrary 
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subsets of router resources and/or protocols. However, 
in this paper we take the first step towards this vision by 
focusing how to “graft” a BGP session and the underly- 
ing link from one router to another. 


1.1 A Case for Router Grafting 


The ability to adapt the network is an essential com- 
ponent of network management. Unfortunately, to- 
day’s routers and routing protocols make change diffi- 
cult. Changes to the network cause disruption, forc- 
ing operators to weigh the benefit of making a change 
against the potential impact performing the change will 
have. For example, today, the basic task of rehoming a 
BGP session requires shutting down the session, recon- 
figuring the new router, restarting the session, and ex- 
changing a large amount of routing information typically 
leading to downtimes of several minutes. Further com- 
plicating matters is the fact that service-level agreements 
with customers often prohibit events that result in down- 
time without receiving prior approval and scheduling a 
maintenance window. This hand-cuffs the operator. In 
this section we provide several motivating examples of 
why seamless migration is needed and why it would be 
desirable to do at the level of individual sessions. 

Load balancing across blades in a cluster router: 
Today’s high-end routers have modular designs consist- 
ing of many cards—processor blades for running rout- 
ing processes and interface cards for terminating links— 
spread over multiple chassis. In essence, the router itself 
is a large distributed system. Load balancing is an im- 
portant function in distributed systems, and routers are 
no exception—today’s routers often run near their lim- 
its of processing capacity [1]. Unfortunately, routers are 
not built with load balancing in mind. A BGP session 
is associated with a routing process on a particular blade 
upon establishment, making it difficult to shift load to an- 
other blade. A common approach used with Web servers 
is to drain load by directing new requests to other servers 
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and waiting for existing requests to complete. Unfortu- 
nately, this technique is not applicable to routers, since 
routing sessions run indefinitely and unlike web services 
have persistent state. However, with the ability to mi- 
grate individual sessions, achieving better utilization of 
the router’s processing capabilities is possible. 


Rehoming a customer: An ISP homes a customer to 
a router based on geographic proximity and the availabil- 
ity of a router slot that can accommodate the customer’s 
request [2]. However, this is done only at the time when 
a customer initiates service, based on the state of the net- 
work at that time. Rehoming might be necessary if the 
customer upgrades to a new service (such as multicast, 
IPv6, or advanced QoS or monitoring features) available 
only on a subset of routers. Rehoming is also necessary 
when an ISP upgrades or replaces a router and needs to 
move sessions from the old router to the new one. Cus- 
tomer rehoming involves moving the edge link—which 
can be done quickly because of recent innovations in 
layer-two access networks—as well as the BGP session. 


Planned maintenance: Maintenance is a fact of life 
for network operators, yet, even though maintenance is 
planned in advance, little can be done to keep the router 
running. Consider a simple task of replacing a power 
supply. The best common practice is for operators to re- 
configure the routing protocols to direct traffic away from 
that router and, once the traffic stops flowing, to take the 
router offline. Unfortunately, this approach only works 
for core routers within an ISP where alternate paths are 
available. At the edge of the network, an attractive al- 
ternative would be to graft all of the BGP sessions with 
neighboring networks to other routers to avoid disrup- 
tions in service. Migrating at the level of individual ses- 
sions is preferable to migrating all of the sessions and the 
routing processes as a group, since fine-grain migration 
allows multiple different routers to absorb only a small 
amount of extra load during the maintenance interval. 


Traffic engineering: Traffic engineering is the act of 
reconfiguring the network to optimize the flow of traffic, 
to minimize congestion. Today, traffic engineering in- 
volves adjusting the routing-protocol parameters to coax 
the routers into computing new paths that better match 
the offered traffic, at the expense of transient disrup- 
tions during routing convergence. Router grafting en- 
ables a new approach to traffic engineering, where cer- 
tain customers are rehomed to an edge router that better 
matches the traffic patterns. For example, if most of a 
customer’s traffic leaves the ISP’s network at a particular 
location, that customer could be rehomed closer to that 
egress point. In other words, we no longer need to con- 
sider the traffic matrix as fixed when performing traffic 
engineering—instead, we can change the traffic matrix to 
better match the backbone topology and routing by hav- 
ing traffic enter the network at a new location. 
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1.2 Challenges and Contributions 


The benefits of router grafting are numerous. How- 
ever, the design of today’s routers and routing proto- 
cols make realizing router grafting challenging. Graft- 
ing a BGP session involves (1) migrating the underlying 
TCP connection, (11) exchanging routing state, (111) mov- 
ing the routing-protocol configuration from one router 
to another, and (iv) migrating the underlying link. Ide- 
ally, all these actions need to be performed in a manner 
that is completely transparent (1.e., without involving the 
routers and operators in neighboring networks) and does 
not disrupt forwarding and routing (1.e., data packets are 
not dropped and routing adjacencies remain up). 

Unfortunately, we cannot simply apply existing tech- 
niques for application-level session migration. Moving 
a BGP session to a different router changes the net- 
work topology and hence, the routing decisions at other 
routers. In particular, the remote end-point of the session 
must be informed of any routing changes—that is, any 
differences between the “best routes” chosen by the new 
and old homing points. Similarly, other routers in the 
ISP network need to change how they route toward des- 
tinations reachable through that remote end-point—they 
need to learn that these destinations are now reachable 
through the new homing location. 

In addition, we cannot simply apply recently-proposed 
techniques for virtual-router migration [3], for two main 
reasons. First, the two physical routers may not be 
compatible—they may run different routing software 
(e.g., Cisco, Juniper, Quagga, or XORP). Second, we 
want to migrate and merge only a single BGP session, 
not the entire routing process, as many scenarios bene- 
fit from finer granularity. Instead, we view virtual-router 
migration as a complementary management primitive. 

Fortunately, extending existing router software to sup- 
port grafting requires only modest changes. The essential 
state that must be migrated is often well separated in the 
code. This makes it possible to export the state from one 
router and import it to another without much complex- 
ity. In this paper, we present an architecture for realizing 
router grafting and make the following contributions: 


e Introduce the concept of router grafting, and re- 
alize an instance of it through BGP session mi- 
gration. We demonstrate that BGP session migra- 
tion can be performed in today’s monolithic rout- 
ing software, without much modification or refac- 
toring of the code. Our fully-automated prototype 
router-grafting system is built by using and extend- 
ing Click, Linux, and Quagga. 


e Achieve transparency, where the remote BGP ses- 
sion end-point is not modified and is unaware mi- 
gration is happening. We achieve this by bootstrap- 
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Figure 1: Migration protocol layers. 


ping a routing session at the new homing location, 
with the old router emulating the remote end-point. 
The new homing point then takes over the role of the 
old router, sending the necessary routing updates to 
notify the remote end-point of routing changes. 


e Introduce optimizations to nearly eliminate the im- 
pact of migration on other routers not directly in- 
volved in the migration. We achieve this by capital- 
izing on the fact that the routers already have much 
of the routing information they need, and that we 
know the identity of the old and new homing points. 


e Describe an architecture where unplanned routing 
changes (such as link failures) during the grafting 
process do not affect correctness, and where pack- 
ets are delivered successfully even during the mi- 
gration. At worst, packets temporarily traverse a 
different path than the control plane advertises—a 
common situation during routing convergence. 


The remainder of the paper is organized as follows. 
Section 2 discusses how the operation of BGP makes 
router grafting challenging. In Section 3 we present the 
router grafting architecture, focusing only on the control 
plane. Section 4 explains how we ensure correct routing 
and forwarding, even in the face of unplanned routing 
changes. In Section 5 we present our prototype, followed 
by a discussion of optimizations that reduce the overhead 
of grafting a BGP session in Section 6. We present an 
evaluation of our prototype and proposed optimizations 
in Section 7, followed by related work in Section 8 and 
the conclusion in Section 9. 


2 BGP Routing Within a Single AS 


Grafting a BGP session 1s difficult because BGP rout- 
ing relies on many /ayers in the protocol stack and many 


components within an AS. In this section, we present a 
brief overview of BGP routing from the perspective of 
a single autonomous system (AS) to identify the chal- 
lenges our grafting solution must address. 


2.1 Protocol Layers: IP, TCP, & BGP 


As illustrated in Figure 1, two neighboring routers ex- 
change BGP update messages over a BGP session that 
runs on top of a TCP connection that, in turn, directs 
packets over the underlying IP link(s) between them. As 
such, grafting a BGP session will require moving the IP 
link, TCP connection, and BGP session from one loca- 
tion to another. 

IP link: An AS connects to neighboring ASes through 
IP links. While a link could be a direct cable between 
two routers, these [P-layer links typically correspond to 
multiple hops in an underlying layer-two network. For 
example, routers at an exchange point often connect via 
a shared switch, and an ISP typically connects to its cus- 
tomers over an access network. These layer-two net- 
works are increasingly programmable, allowing dynamic 
set-up and tear-down of layer-three links [4, 5, 6, 7]. This 
is illustrated in Figure 1 where the link between routers 
A and B is through a programmable transport network 
which can be changed to connect routers A and C. These 
innovations enable seamless migration of an IP link from 
one location to another within the scope of the layer-two 
network, such as rehoming a customer’s access link to 
terminate on a different router in the ISP’s network!. 

TCP connection: The neighboring routers exchange 
BGP messages over an underlying TCP connection. Un- 
like a conventional TCP connection between a Web 
client and a Web server, the connection must stay “up” 
for long periods of time, as the two routers are continu- 
ously exchanging messages. Further, each router sends 
keep-alive messages to enable the other router to detect 
lapses in connectivity. Upon missing three keep-alive 
messages, a router declares the other router as dead and 
discards all BGP routes learned from that neighbor. As 
such, grafting a BGP session requires timely migration 
of the underlying TCP connection. 

BGP session: Two adjacent routers form a BGP ses- 
sion by first establishing a TCP session, then sending 
messages negotiating the properties of the BGP session, 
then exchanging the “best route” for each destination 
prefix. This process is controlled by a state machine that 
specifies what messages to exchange and how to han- 
dle them. Once the BGP session is established, the two 


'Depending on the technology used to realize the layer-two net- 
work, the scope might be geographically contained, e.g., in the case 
of a packet access network, or might be significantly more spread out, 
e.g., in the case of a national footprint programmable optical transport 
network. 
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routers send incremental update messages—announcing 
new routes and withdrawing routes that are no longer 
available. A router stores the BGP routes learned from 
its neighbor in an Adj-RIB-in table, and the routes an- 
nounced to the neighbor in an Adj-RIB-out table. Each 
BGP session has configuration state that controls how 
a router filters and modifies BGP routes that it imports 
from (or exports to) the remote neighbor. As such, graft- 
ing a BGP session requires transferring a large amount of 
RIB (Routing Information Base) state, as well as moving 
the associated configuration state. 


2.2 Components: Blades, Routers, & ASes 


A BGP session is associated with a routing process that 
runs on a processor blade within one of the routers in 
a larger AS. As such, grafting a BGP session involves 
extracting the necessary state from the routing process, 
transferring that state to another location, and changing 
the routing decisions at other routers as needed. 

Processor blade: The simplest router has a proces- 
sor for running the routing process, multiple interfaces 
for terminating links, and a switching fabric for directing 
packets from one interface to another. The BGP rout- 
ing process maintains sessions with multiple neighbors 
and runs a decision process over the Adj-RIB-in tables 
to select a single “best” route for each destination prefix. 
The routing process stores the best route in a Loc-RIB ta- 
ble, and applies export policies to construct the Adj-RIB- 
out tables and send the corresponding update messages 
to each neighbor. 

IP router: Today’s high-end routers are large dis- 
tributed systems, consisting of hundreds of interfaces and 
multiple processor blades spread over one or more chas- 
sis. These routers run multiple BGP processes—one on 
each processor blade—each responsible for a portion of 
the BGP sessions as shown in Figure 2. For a cluster- 
based router to scale, each BGP process runs its own de- 
cision process and exchanges its “best” route with the 
other BGP processes in the router, using a modified ver- 
sion of internal BGP (iBGP) [8]. This allows the dis- 
tributed router to behave the same way as a simple router 
that runs a single BGP process. Any BGP process can 
handle any BGP session, since all processors can reach 
the interface cards through the switching fabric. As such, 
grafting a BGP session from one blade to another in the 
same router (e.g., the session with X from RP1 to RP2 
in Figure 2) does not require migrating the underlying 
layer-three link. 

Autonomous System (AS): An AS consists of mul- 
tiple, geographically-distributed routers. Each router 
forms BGP sessions with neighboring routers in other 
ASes, and uses iBGP to disseminate its “best” route to 
other routers within the AS. The routers in the same 
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AS also run an Interior Gateway Protocol (IGP), such 
as OSPF or IS-IS to compute paths to reach each other. 
Each router in the AS runs its own BGP process(es) and 
selects its own best route for each prefix. The routers 
may come to different decisions about the best route, 
not only because they learn different candidate routes 
but also because the decision depends on the IGP dis- 
tances to other routers (in a practice known as hot-potato 
routing). This can be seen in Figure 3 where routers B 
and C have different paths to the destination d. As such, 
grafting a BGP session from one router to another (e.g., 
the session with A from router B to C in Figure 3) may 
change the BGP routing decisions. 


3 Router Grafting Architecture 


Seamless grafting of a BGP session relies on a care- 
ful progression through a number of coordinated steps. 
These steps are summarized in Figure 4, which shows 
a migrate-from router that hands off one of its BGP ses- 
sions to a migrate-to router in the same AS. These routers 
do not need to run the same software or be from the 
same vendor—they need only have the added support 
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for router grafting. When the grafting process starts, the 
migrate-from router is responsible for handling a BGP 
session with the remote end-point router A (not shown). 
This BGP session with router A is to be migrated. The 
migrate-from router begins exporting the routing infor- 
mation and the migrate-to router is initialized with its 
own session-level data structures and a copy of the policy 
configuration, without actually establishing the session 
(Figure 4(a)). Then, the TCP connection is migrated, fol- 
lowed by the underlying link (Figure 4(b)). Finally, the 
migrate-to router imports the routing state and updates 
the other routers (Figure 4(c)), resulting in the migrate- 
to router handling the BGP session with the remote end- 
point ((Figure 4(d)). This section focuses exclusively on 
control-plane operations, deferring discussion of the data 
plane until Section 4. 


3.1 Copying BGP Session Configuration 


Each BGP session end-point has a variety of configu- 
ration state needed to establish the session with the re- 
mote end-point (with a given IP address and AS num- 
ber) and apply policies for filtering and modifying route 
announcements. The network Operators, or an auto- 
mated management system, configure the session end- 
point by applying configuration commands at the router’s 
command-line interface or uploading a new configura- 
tion file. The router stores the configuration information 
in various internal data structures. 

Rather than exporting these internal data structures, 
we capitalize on the fact that the current configuration 
is captured in a well-defined format in the configura- 
tion file. Our design simply “dumps” the configura- 
tion file for the migrate-from router, extracts the com- 
mands relevant to the BGP session end-point, and applies 
these commands to the migrate-to router, after appropri- 
ate translation to account for vendor-dependent differ- 
ences in the command syntax. This allows the migrate- 
to router to create its own internal data structures for the 
configuration information. 

However, the migrate-to router is not yet ready to as- 
sume responsibility for the BGP session. To finish ini- 
tializing the migrate-to router, we extend the BGP state 
machine to include an ‘inactive’ state, where the router 
can create data structures and import state for the ses- 
sion without attempting to communicate with the remote 
end-point. The migrate-to router transitions from the ‘in- 
active’ state to ‘established’ state when instructed by the 
grafting process. 


3.2 Exporting & Resetting Run-Time State 


A router maintains a variety of state for BGP session 
end-points. To meet our goals, BGP grafting need 


only consider the Routing Information Bases (RIBs)— 
the other state may be simply reinitialized at the migrate- 
to router’. 

Routing Information Bases (RIBs): The most im- 
portant state associated with the BGP session-end-point 
is stored in the routing information bases—the Adj-RIB- 
in and Adj-RIB-out. In our architecture, we dump the 
RIBs at the migrate-from router to prepare for import- 
ing the information at the migrate-to router. While the 
RIBs are represented differently on different router plat- 
forms, the information they store is standardized as part 
of the BGP protocol. In most router implementations, the 
RIB data structure is factored apart from the rest of the 
routing software, and many routers support commands 
for “dumping” the current RIBs. Even though the RIB 
dump formats vary by vendor, de facto standards like the 
popular MRT format [9] do exist. 

State in the BGP state machine: A BGP session end- 
point stores information about the BGP state machine. 
We can forgo migrating this state — the BGP session is 
either ‘established’ or not. If the session is in one of 
the not-established states, we can simply close the ses- 
sion at the migrate-from router and start the migrate-to 
router in the idle state. This does not trigger any tran- 
sient disruption—since the session is not “up” anyway. 
If the session at the migrate-from router is ‘established,’ 
we can start the new session at the migrate-to router in 
the ‘inactive’ state. 

BGP timers: BGP implementations also include a va- 
riety of timers, many of which are vendor-dependent. For 
example, some routers use an MRAI (Minimum Route 
Advertisement Interval) timer to pace the transmission of 
BGP update messages. This is purely a local operation 
at one end-point of the session, not requiring any agree- 
ment with the remote end-point. Another common timer 
is the keep-alive interval that drives the periodic send- 
ing of heartbeat messages, and a hold timer for detect- 
ing missing keep-alive messages from the remote end- 
point. Fortunately, missing a single keep-alive message, 
or sending the message slightly early or late, would not 
erroneously detect a session failure because routers typ- 
ically wait for three missed keep-alive messages before 
tearing down the session. As such, we do not migrate 
BGP timer values and instead simply initialize whatever 
timers are used at the migrate-to router. 

BGP statistics: BGP implementations maintain nu- 
merous statistics about each session and even individual 
routes. These statistics, while broadly useful for network 
monitoring, are not essential to the correct operation of 
the router. They only have meaning at the local session 


Router grafting does not preclude the remaining state from being 
included, simply we chose not to in order to keep code modifications at 
a minimum while still meeting our goals of (1) routing protocol adjacen- 
cies staying up and (11) all routing protocol messages being received. 
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end-point. In addition, these statistics are vendor depen- 
dent and not well modularized in the router software im- 
plementations. As such, we do not migrate these statis- 
tics and instead allow the migrate-to router to initialize 
its Own Statistics as if it were establishing a new session. 


3.3. Migrating TCP Connection & IP Link 


As part of BGP session grafting, the TCP connection 
must move from the migrate-from router to the migrate- 
to router. Because we do not assume any support from 
the remote end-point, the migrate-to router must use the 
same IP addresses and sequence and acknowledgment 
numbers that the migrate-from router was using. In BGP, 
IP addresses are used to uniquely identify the BGP ses- 
sion end-points and not the router as a whole. Further, 
we assume the link between the remote end-point and 
the migrate-from (or migrate-to) router is a single hop IP 
network where the IP address is not used for reachability, 
but only for identification. As such, the session end-point 
can easily retain its address (and sequence and acknowl- 
edgment numbers) when it moves. That is, the single IP 
address identifying the migrating session can be disasso- 
ciated from the migrate-from router and associated with 
the migrate-to router. Our architecture simply migrates 
the local state associated with the TCP connection from 
one router to another. 

As with any TCP migration technique, the network 
must endure a brief period of time when neither router 
is responsible for the TCP connection. TCP has its own 
retransmission mechanism that ensures that the remote 
end-point retransmits any unacknowledged data. As long 
as the transient outage is short, the TCP connection (and, 
hence, the BGP session) remains up. TCP implementa- 
tions tolerate a period of at least 100 seconds [10] with- 
out receiving an acknowledgment—significantly longer 
than the migration times we anticipate. The amount of 


NSDI 710: 7th USENIX Symposium on Networked Systems Design and Implementation 


TCP state is relatively small, and the two routers are 
close to one another, leading to extremely fast TCP mi- 
gration times. 

The underlying link should be migrated (e.g., by 
changing the path in the underlying programmable trans- 
port network) close to the same time as the TCP connec- 
tion state, to minimize the transient disruption in con- 
nectivity. Still, the network may need to tolerate a brief 
period of inconsistency where (say) the TCP connec- 
tion state has moved to the migrate-to router while the 
traffic still flows via the migrate-from router. During 
this period, we need to prevent the migrate-from router 
from erroneously responding to TCP packets with a TCP 
RST packet that resets the connection. This is easily 
prevented by configuring the migrate-from router’s in- 
terface to drop TCP packets sent to the BGP port (..e., 
179). The migrate-from route can successfully deliver 
regular data traffic received during the transmission, as 
discussed later in Section 4. 


3.4 Importing BGP Routing State 


Once link and connection migration are complete, the 
migrate-to router can move its end-point of the BGP ses- 
sion from the ‘inactive’ state to the ‘established’ state. At 
this time, the migrate-to router can begin “importing” the 
RIBs received from the migrate-from router. However, 
the import process is not as simple as merely loading 
the RIB entries into its own internal data structures. The 
migrate-from and migrate-to routers could easily have a 
different view of the “best” route for each destination 
prefix, as illustrated in Figure 5. In this scenario, be- 
fore the migration, A reaches E’s prefixes over the di- 
rect link between them, and B reaches E’s prefixes via 
A; after the migration, A should reach E’s prefixes via 
B, and B should reach E’s prefixes over the direct link. 
Similarly, suppose routers C and D connect to a common 
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Figure 5: A topology where AS 200 has migrate-from 
router A, migrate-to router B, internal router F, and ex- 
ternal routers C, D, and G, and remote end-point E. 


prefix. Before the migration, E follows the AS path “100 
200 300” (through C) to reach that prefix; after the mi- 
gration E follows the AS path “100 200 400” (through 
D). Reaching these conclusions requires routers A and 
B to rerun the BGP decision process based on the new 
routes, and disseminate any routing changes to neighbor- 
ing routers. 


To make the process transparent to the remote end- 
point, we essentially emulate starting up a new session 
at router B, with router A temporarily playing the role 
of the remote end-point to announce the routes learned 
from E. This requires router A to replay the Adj-RIB- 
in state associated with E to router B. Router B stores 
these routes and reruns its BGP decision process, as nec- 
essary, to compute the new best routes to prefixes E is 
announcing. This will cause update messages to be sent 
to other routers within the AS and, sometimes, to exter- 
nal routers (like C and D). If the attributes of the route 
(e.g., the AS-PATH) do not change, as is the case in Fig- 
ure 5, other ASes like AS 300 and AS 400 do not re- 
ceive any BGP update message (since, from their point 
of view, the route has not changed), thus minimizing the 
overhead that router grafting imposes on the global BGP 
routing system. 


Next, we update E with the best routes selected by 
B. Here, we take advantage of the fact that E has al- 
ready learned routes from the migrate-from router A. 
The change in topology might change some of those 
routes, and we need to account for that. To do so, the 
migrate-to router runs the BGP decision process to com- 
pare its currently-selected best route to the route learned 
from the migrate-from router. If the best route changes, 
B sends an update message to its neighbors, including 
router E. This is in fact exactly the same operation the 
router would perform upon receiving a route update from 
any of its neighbors. We expect that routers A and B 


would typically have the same best route for most pre- 
fixes, especially if A and B are relatively close to each 
other in the IGP topology. As such, most of the time 
router B would not change its best route and hence would 
not need to send an update message to router E. 


4 Correct Routing and Forwarding 


Router grafting cannot be allowed to compromise the 
correct functioning of the network. In this section, we 
discuss how grafting preserves correct routing state (in 
the control plane) and correct packet forwarding (in the 
data plane), even when unexpected routing changes oc- 
cur in the middle of the grafting process. 


4.1 Control Plane: BGP Routing State 


Routing changes can, and do, happen at any time. BGP 
routers easily receive millions of update messages a day, 
and these could arrive at any time during the grafting pro- 
cess — while the migrate-from router dumps its routing 
state, while the TCP connection and underlying link are 
migrated, or while the migrate-to router imports the rout- 
ing state and updates its routing decisions. Our grafting 
solution can correctly handle BGP messages sent at any 
of these times. 

While the migrate-from router dumps the BGP 
routing state: The goal is to have the in-memory Rout- 
ing Information Base (RIB) be consistent with the RIB 
that was dumped as part of migration. Here, we take 
advantage of the fact that the dumping process and the 
BGP protocol work on a per-prefix basis. Consider a 
Adj-RIB-in with three routes (pl, p2, p3) corresponding 
to three prefixes, of which (p1 and p2) have been dumped 
already. When an update p3’ (for the same prefix as p3) 
is received, the in-memory RIB can be updated since it 
corresponds to a prefix that has not been dumped, — to 
prevent dumping a prefix while it is being updated, the 
single entry in the RIB needs to be locked. If we re- 
ceive an update p1’ (for the same prefix as pl), process- 
ing it and updating the in-memory RIB without updating 
the dumped image will cause the two to be inconsistent 
— delaying processing the update is an option, but that 
would delay convergence as well. To solve this, we cap- 
italize on BGP being an incremental protocol where any 
new update message implicitly withdraws the old one. 
Since we treat the dumped RIB as a sequence of update 
messages, we can process the update immediately and 
append p1’ to the end of the dumped RIB to keep it con- 
sistent. 

While the TCP connection and link are migrating: 
BGP update messages may be sent while the TCP con- 
nection and the underlying link are migrating. If a mes- 
sage is sent by the remote end-point, the message is not 
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delivered and is correctly retransmitted after the link and 
TCP connection come up at the migrate-to router. If an 
update message is sent by another router to the migrate- 
from router over a different BGP session, there is not a 
problem because the migrate-from router is no longer re- 
sponsible for the recently-rehomed BGP session. There- 
fore, the migrate-from router can safely continue to re- 
ceive, select, and send routes. If an update message is 
sent by another router to the migrate-to router over a dif- 
ferent BGP session, the migrate-to router can install the 
route in its Adj-RIB-in for that session and, if needed, 
update its selection of the best route — similar to when a 
route is received before the migration process. 

While the migrate-to router imports the routing 
state: The final case to consider is when the migrate-to 
router receives a BGP update message while importing 
the routing state for the rehomed session. Whether from 
the remote end-point or another router, if the route is for 
a prefix that was already imported, there is no problem 
since the migration of that prefix is complete. If it 1s for 
a prefix that has not already been imported, only mes- 
sages from the remote end-point router need special care. 
(BGP is an asynchronous protocol that does not depend 
on the relative order of processing for messages learned 
from different neighbors.) A message from the remote 
end-point must be processed after the imported route but 
we would like to process it immediately. Since the update 
implicitly withdraws the previous announcement (which 
is in the dump image), we mark the RIB entry to indicate 
that it is more recent than the dump image. This way, we 
can skip importing any entries in the dump image which 
have a more recent RIB update. 


4.2 Data Plane: Packet Forwarding 


Thus far, this paper has focused on the operation of the 
BGP control plane. However, the control plane’s only 
real purpose is to select paths for forwarding data pack- 
ets. Fortunately, grafting has relatively little data-plane 
impact. When moving a BGP session between blades in 
the same router, the underlying link does not move and 
the “best” routes do not change. As such, the forwarding 
table does not change, and data packets travel as they did 
before grafting took place — the data traffic continues to 
flow uninterrupted. 

The situation is more challenging when grafting a 
BGP session from one router to another, where these 
two routers do not have the same BGP routing infor- 
mation and do not necessarily make the same decisions. 
Because the TCP connection and link are migrated be- 
fore the migrate-to router imports the routing state, the 
remote end-point briefly forwards packets through the 
migrate-to router based on BGP routes learned from the 
migrate-from router. Since BGP route dissemination 
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Figure 6: The router grafting prototype system. 


within the AS (typically implemented using 1BGP) en- 
sures that each router learns at least one route for each 
destination prefix, the two routers will learn routes for 
the same set of destinations. Therefore, the undesirable 
situation where the remote end-point forwards packets 
that the migrate-to router cannot handle will not occur. 
Although data packets are forwarded correctly, the 
end-to-end forwarding path may temporarily differ from 
the control-plane messages. For example, in Figure 5, 
data packets sent by E will start traversing the path 
through AS 400, while E’s control plane still thinks the 
AS path goes through AS 300. These kinds of temporary 
inconsistencies are a normal occurrence during the BGP 
route-convergence process, and do not disrupt the flow 
of traffic. Once the migrate-to router finishes importing 
the routes, the remote end-point will learn the new best 
route and control- and data-plane paths will agree again. 
Correct handling of data traffic must also consider 
the packets routed toward the remote end-point. Dur- 
ing the grafting process, routers throughout the AS for- 
ward these packets to the migrate-from router until they 
learn about the routing change (1.e., the new egress point 
for reaching these destinations). Since the migrate- 
from router knows where the link, TCP connection, and 
BGP session have moved, it can direct packets in flight 
there through temporary tunnels established between the 
migrate-from router and the migrate-to router. 


5 BGP Grafting Prototype 


We have developed an initial prototype to demonstrate 
router grafting. Figure 6 depicts the main components of 
the prototype. These include (1) a modified Quagga [11] 
routing software, (11) the graft daemon for controlling the 
entire process, (111) the SockMi [12] kernel module for 
TCP migration, and (iv) a Click [13] based data plane 
for implementing link migration. 

The controlling entity in the prototype is the graft dae- 
mon. This is the entity that initiates the BGP session 
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grafting, interacting with each of the other components 
to perform the necessary steps. We assume each graft 
daemon can be reached by an IP address. With this, 
the graft daemon on the migrate-from router will initi- 
ate a TCP connection with the daemon on the migrate-to 
router. Once established, the migration process follows 
the six general steps discussed in the following subsec- 
tions. 


5.1 Configuring the Migrate-To Router 


In our architecture, configuration state is gleaned from 
a dump of the migrate-from router’s configuration file, 
rather than its internal data structures. The graft daemon 
first extracts BGP session configuration from the config- 
uration file of the migrate-from router, including the rules 
for filtering and modifying route announcements. Then 
the extracted configuration commands are applied to the 
migrate-to router. Our current implementation includes 
a simplistic parser for Quagga’s commands for configur- 
ing BGP sessions’. In order to configure the migrate-to 
router before migrating the TCP connection, we added an 
‘inactive’ state to the BGP state machine. We also added 
a configuration command to the Quagga command-line 
interface: 


neighbor w.x.y.z inactive 


that triggers the router to create all internal data struc- 
tures for the session, without attempting to open or ac- 
cept a socket with the remote end-point. 


5.2 Exporting Migrate-From BGP State 


Once the migrate-to router is configured, the grafting 
process can proceed to the second step, which is ini- 
tiating the export of the routing state on the migrate- 
from router. The grafting daemon on the migrate-from 
router initiates the export process by calling a command 
in Quagga that we added: 


neighbor w.x.y.z migrate out 


When this command is executed, our modified Quagga 
software traverses the internal data structures, dumping 
the necessary routing state (Adj-RIB-in and the selected 
routes in the loc-RIB) to a file. 


3As we add support for XORP, we will develop a more complete 
parser as the configuration will require translating between configura- 
tion languages—generally a hard problem, though easier in our case 
because we focus on a relatively narrow aspect of the configuration. 


5.3. Exporting Migrate-From TCP State 


Once the routing state is dumped, the modified Quagga 
calls the export_socket function as part of the 
SockMi API to migrate the TCP state. This function 
makes an ioct1 call to the kernel module, passing the 
socket’s file descriptor. The SockMi kernel module is 
a Linux kernel module for kernels 2.4 through 2.6—we 
tested with kernel version 2.6.19.7. The ioctl call 
causes the kernel module to interact with Linux’s inter- 
nal data structures. It removes the TCP connection from 
the kernel, writing the socket state to a character device. 
Note that part of this state is related to the protocol itself 
(e.g., the current sequence number) as well as the buffers 
(e.g., the receive queue and the transmit queue of packets 
sent, but not acknowledged). When this state is written, 
the kernel module sends a signal to the graft daemon on 
the migrate-from router, which can read from the char- 
acter device and send to the daemon on the migrate-to 
router. 


5.4 Importing the TCP State 


The next step is to initiate the import of the TCP state at 
the migrate-to router. Upon receiving the state from the 
migrate-from router, the graft daemon on the migrate-to 
router first notifies Quagga that it is about to import state 
for a given ‘inactive’ session. This is done through a 
command we added: 


neighbor w.x.y.z migrate in 


Upon executing the command, our modified Quagga in- 
vokes the import_socket function in the SockMi 
API. This function blocks until a TCP connection is im- 
ported. During this time, the graft daemon makes an 
ioct1 to the SockMi kernel module. The graft daemon 
then passes the TCP session state to a character device 
which is read by the kernel module. The SockMi ker- 
nel module accesses the Linux data structures to add a 
socket with that TCP connection state, which unblocks 
the import_socket function. 


5.5 Migrating the Layer-Three Link 


At this point, the graft daemon of the migrate-to router 
triggers the migration of the underlying link. This in- 
cludes removing the migrating session’s IP address from 
the migrate-from router, adding the IP address to the 
migrate-to router, and migrating the layer-two link. As 
we did not have access to equipment to use a pro- 
grammable transport network, we instead built our own 
simple layer-two network that connects both the migrate- 
from and migrate-to router to the remote end-point with a 
Click [13] configuration that emulates a ‘programmable 
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transport’. This Click configuration performs a simple 
switching primitive that connects the remote end-point to 
either the migrate-from or the migrate-to router. In one 
setting, packets from the migrate-from router are sent to 
the remote end-point router, packets from the migrate- 
to router are dropped, and packets from the remote end- 
point router are sent to the migrate-from router. With 
the alternative setting, the reverse occurs, forming a link 
between the migrate-to router and the remote-end point 
router. This switch value is settable via a handler, making 
it accessible to the graft daemon running on the migrate- 
from router. 


5.6 Importing Routing State 


As the final step, when the importing of the TCP connec- 
tion is complete and the import_socket function is 
unblocked, the modified Quagga reads the routing state, 
which was stored in a file when the local graft daemon 
read it in from the graft daemon running on the migrate- 
from router. Much as the “normal” operation of the 
router, which receives a BGP message from a socket and 
then calls a function to handle the update, the importing 
process will read the Adj-RIB-in from a file and call the 
same function to process the routing update. For compar- 
ing the RIB from the migrate-from router to the migrate- 
to router, the importing process reads the route from the 
file, looks up the route in the local RIB, and compares 
them. If they differ, it will use existing functions to send 
out the route to the peer. 


6 Optimizations for Reducing Impact 


Grafting a BGP session requires incrementally updating 
the remote end-point as well as the other routers in the 
AS. In this section, we present optimizations that can 
further reduce the traffic and processing load imposed 
on routers not directly involved in the grafting process. 
These optimizations capitalize on the knowledge that 
grafting is taking place and the routers’ local copy of 
the routes previously learned from the remote end-point. 
First, we discuss how we can keep routers from send- 
ing unnecessary updates to their eBGP neighbors. Sec- 
ond, we then discuss how the majority of iBGP messages 
can be eliminated. Finally, we consider the intra-cluster 
router case where the routes do not change. 


6.1 Reducing Impact on eBGP Sessions 


Importing routes on the migrate-to router, and with- 
drawing routes on the migrate-from router, may trigger 
a flurry of update messages to other BGP neighbors. 
Consider the example in Figure 5, where before graft- 
ing router E had announced 192.168.0.0/16 to router A, 
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which in turn announced the route to B and C. Eventu- 
ally two things will happen: (i) the migrate-from router 
A will remove the 192.168.0.0/16 route from E and (11) 
the migrate-to router B will add the 192.168.0.0/16 route 
from E. Without any special coordination, these two 
events could happen in either order. 


If A removes the route before B imports it, then A’s 
eBGP neighbors (like router C) may receive a withdrawal 
message, or briefly learn a different best route (should A 
have other candidate routes), only to have A reannounce 
the route upon (re)learning it from B. Alternatively, if B 
adds the route before A sends the withdrawal message to 
C, then A may have both a withdrawal message and the 
subsequent (re)announcement queued to send to router 
C, perhaps leading to redundant BGP messages. In the 
first case, C may temporarily have no route at all, and in 
the second case C may receive redundant messages. In 
both cases these effects are temporary, but we would like 
to avoid them if possible. 


To do so, rather than deleting the route, A can mark the 
route as “exported’”’—safe in the knowledge that, if this 
route should remain the best route, A will soon (re)learn 
it from the migrate-to router B. For example, suppose the 
route from E is the only route for the destination prefix— 
then A would certainly (re)learn the route from B, and 
could forgo withdrawing and reannouncing the route to 
its other neighbors. Of course, if A does not receive the 
announcement (either after some period of time or im- 
plicitly through receiving an update with a different route 
for that prefix), then it can proceed with deleting the ex- 
ported route. 


So far we only considered the eBGP messages the 
migrate-from router would send. A similar situation can 
occur on the eBGP sessions of the other routers in the 
AS (e.g., router F). This is because these other routers 
must be notified (via iBGP) to no longer go through A 
for the routes learned over the migrating session (..e., 
with E). Therefore, the migrate-from router must send 
out withdrawal messages to its iBGP neighbors and the 
migrate-to router must send out announcements to its 
iBGP neighbors. This may result in the other routers in 
the AS (e.g., router F) temporarily withdrawing a route, 
temporarily sending a different best route, or sending a 
redundant update to their eBGP neighbors. Because of 
this, we have the migrate-from router send the marked 
list to each of its iBGP neighbors and a notification that 
these all migrated to the migrate-to router — this list is 
simply the list of prefixes, not the associated attributes. 
We expect this list to be relatively small in terms of total 
bytes. With this list, the other routers in the AS can per- 
form the same procedure, and eliminate any unnecessary 
external messages. 


USENIX Association 


USENIX Association 


6.2 Reducing Impact on iBGP Sessions 


While using iBGP unmodified is sufficient for dealing 
with the change in topology brought about by migration, 
it is still desirable to reduce the impact migration has on 
the iBGP sessions. Here, since the route-selection pol- 
icy will likely be consistent throughout an ISP’s network, 
we can reduce the number of update messages sent by 
extending iBGP (an easier task than modifying eBGP). 
When the migrate-from and migrate-to routers select the 
same routes, the act of migration will not change the 
decision. Since all routers are informed of the migra- 
tion, the iBGP updates can be suppressed (the migrate- 
from router withdrawing the route and the migrate-to 
router announcing the route). When the migrate-from 
and migrate-to routers select different routes, it is most 
likely due to differences in IGP distances. For the 
migrate-to router, the act of migration will cause all 
routes learned from the remote end-point router to be- 
come directly learned routes, as opposed to some dis- 
tance away, and therefore the migrate-to router will now 
prefer those routes (except when the migrate-to router’s 
currently selected route is also directly learned). This 
change in route selection causes the migrate-to router to 
send updates to its iBGP neighbors notifying them of the 
change. However, since it is more common to change 
routes, we can reduce the number of updates that need to 
be sent with a modification to iBGP where updates are 
sent when the migrate-to router keeps a route instead of 
when it changes a route. Other routers will be notified of 
the migration and will assume the routes being migrated 
will be selected unless told otherwise. 


6.3. Eliminating Processing Entirely 


Re-running the route-selection processes is essential as 
migration can change the topology, and therefore change 
the best route. When migrating within a cluster router, 
the topology does not change, and therefore we should 
be able to eliminate processing entirely. The selected 
best route will be a consistent selection on every blade. 
Therefore, even when migrating, while the internal data 
structures might need to be adjusted, no decision pro- 
cess needs to be run and no external messages need to be 
sent. In fact, there is no need for any internal messages 
to be sent either. With the modified 1BGP used for com- 
munication between route processor blades, the next hop 
field is the next router, not the next processor blade — 1.e., 
iBGP messages are only used to exchange routes learned 
externally and do not affect how packets are forwarded 
internally. Therefore, upon migration, there is no need 
to send an update as the routes learned externally have 
already been exchanged. 

While exchanging messages and running the decision 


process can be eliminated, transferring the routing state 
from the exporting blade to the importing blade is still 
needed. Being the blade responsible for a particular BGP 
session requires that the local RIB have all of the routes 
learned over that session. While some may have been 
previously announced by the migrate-from blade, not all 
of them were. Therefore, we need to send over the Adj- 
RIB-in for the migrating session in order to know all 
routes learned over that session as well as which subset 
of routes the migrate-from blade announced were asso- 
ciated with that session. 


7 Performance Evaluation 


In this section, we evaluate router grafting through exper- 
iments with our prototype system and realistic traces of 
BGP update messages. We focus primarily on control- 
plane overhead, since data-plane performance depends 
primarily on the latency for link migration—where our 
solution simply leverages recent innovations in pro- 
grammable transport networks. First, we evaluate our 
prototype implementation from Section 5 to measure the 
grafting time and CPU utilization on the migrate-from 
and migrate-to routers. Then we evaluate the effective- 
ness of our optimizations from Section 6 in reducing the 
number of update messages received by other routers. 


7.1 Grafting Delay and Overhead 


The first experiment measures the impact of BGP ses- 
sion grafting on the migrate-from and migrate-to routers. 
To do this we supplemented the topology shown in Fig- 
ure 5 with a router adjacent to E (in a different AS) and a 
router adjacent to B (in a different AS). These two extra 
routers were fed a BGP update message trace taken from 
Route Views [14]. This essentially fills the RIB of B and 
E with routes that have the same set of prefixes, but dif- 
ferent paths. We used Emulab [15] to run the experiment 
on servers with 3GHz processors and 2GB RAM.4 

The time it takes to complete the migration process is 
a function of the size of the routing table. The larger it 
is, the larger the state that needs to be transferred and 
the more routes that need to be compared. To capture 
this relationship, we varied the RIB size by replaying 
multiple traces. The results, shown in Figure 7, include 
both the case where migration occurs between routers 
(when the migrate-to router must run the BGP decision 
process) and the case where migration is between blades 
(where the decision process does not need to run because 
the underlying topology is not changed). The “between 
blades” curve, then, illustrates the time required to trans- 
fer the BGP routes and import them into the internal data 


4This is roughly comparable to the route processors used in com- 
mercially available high-end routers. 
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Figure 7: BGP session grafting time vs. RIB size. 


structures. Note that these results do not imply that TCP 
needs to be able to handle this long of an outage where 
packets go unacknowledged — the TCP migration process 
takes less than a millisecond. Instead, when compared 
to rehoming a customer today, where there is downtime 
measured in minutes, the migration time is small. In 
fact, since in our setup AS100 and AS200 have a peer- 
ing agreement, the actual migration time would be less if 
AS100 were a customer of AS200 (since AS100 would 
announce fewer routes to AS200). 

The CPU utilization during the grafting process is also 
important. The BGP process on the migrate-from router 
experienced only a negligible increase in CPU utiliza- 
tion for dumping the BGP RIBs. The migrate-to router 
needs to import the routing entries and compare routing 
tables. For each prefix in the received routing informa- 
tion, the migrate-to router must perform a lookup to find 
the routing table entry for that prefix. Figure 8 shows 
the CPU utilization at 0.2 second intervals, as reported 
by top, for the case where the RIB consists of 200,000 
prefixes. There are three things to note. First, the CPU 
utilization is roughly constant. This is perhaps due to the 
implementation where the data is received, placed in a 
file, then iteratively read from the file and processed be- 
fore reading the next. This keeps the CPU utilization 
at only a fraction as computation is mixed with reads 
from disk. Second, the CPU utilization is the same for 
both migrating between routers and migrating between 
blades. The case between routers merely takes longer be- 
cause of the additional work involved in running the BGP 
decision process. Third, migration can be run as a lower 
priority task and use less CPU but take longer — prevent- 
ing the migration from effecting the performance of the 
router during spikes in routing updates, which commonly 
results in intense CPU usage during the spikes. 


7.2 Optimizations for Reducing Impact 


While the impact on the migrate-from and migrate-to 
routers is important, perhaps a more important metric is 
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Figure 8: The CPU utilization at the migrate-to router 
during migration, with a 200k prefix RIB. 


the impact on the routers not involved in the migration, 
including other routers within the same AS as well as the 
eBGP neighbors. If the overhead of grafting is relatively 
contained, network operators could more freely apply the 
technique to simplify network-management tasks. 

First and foremost, the remote end-point experiences 
an overhead directly proportional to the number of ad- 
ditional BGP update messages it receives. The num- 
ber of messages depends on how many best routes dif- 
fer between the migrate-from and migrate-to router—the 
migrate-from router must send an update message for ev- 
ery route that differs. The exact amount depends heav- 
ily on the proximity of the migrate-from and migrate- 
to routers—if the two routers are in the same Point-of- 
Presence of the ISP, perhaps no routes would change. 
As such, we do not expect this overhead to be signifi- 
cant. Since the sources of overhead for the remote end- 
point are relatively well understood, and it is difficult to 
acquire the kinds of intra-ISP measurement data neces- 
sary to quantify the number of route changes, we do not 
present a plot for this case. 

Perhaps the more significant impact is on the other 
routers, both within the AS and in other ASes, that may 
have to learn new routes for the prefixes announced by 
the remote end-point. To evaluate this, we measured the 
number of updates that would be sent as a function of the 
fraction of prefixes where the migrate-from router had 
selected a different route than the migrate-to router. By 
doing so, this covers the entire range of migration targets 
(i.e. 1t does not limit our evaluation to migration within 
a PoP). Recall that this difference is what needs to be 
corrected. Also recall that the prefixes being considered 
here are the ones learned from the router at the remote 
end-point of the session being migrated, not the entire 
routing table, as these are the routes that could impact 
what is sent to other routers. For our measurement, we 
use a fixed set of 100,000 prefixes. However, the results 
are directly proportional to the number of prefixes, and 
can therefore be scaled appropriately — for migrating a 
customer link, the number of prefixes would be signifi- 
cantly smaller, for migrating a peering link, the number 
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Figure 9: Updates sent as a result of migration. 


of prefixes could be higher. 


The results are shown in Figure 9, with the three 
graphs representing the three different cases as discussed 
in Section 6: (a) direct approach with no optimizations, 
(b) optimizations to reduce eBGP messages by capital- 
izing on redundant information in the network, and (c) 
optimizations to reduce iBGP messages by treating the 
route selection changing as the common case. For the 
graphs, each line represents a fixed fraction of differ- 
ing routes that change the selected route as a result of 
the grafting. For example, consider where the migrate- 
from router selects a particular route different than the 
migrate-to router. In this case, after migration, the 
migrate-to router selects the route the migrate-from se- 
lected (.e., it changes its own route). Each line repre- 
sents the fraction of times this change occurs—for ex- 
ample, the line labeled 0.2 in Figure 9 is where 20% of 
the routes that differ will change to the routes selected by 
the migrate-from router. 


There are several things of note from the graphs. First 
is that the direct (unoptimized) approach must send sig- 
nificantly more messages. In the case where the selected 
routes do not differ much, which we consider will be 
a most likely scenario, the optimized approaches hardly 
send any messages at all. Second, when comparing Fig- 
ure 9(b) with Figure 9(c), we can see that depending 
on what would be considered the common case, we can 
choose a method that would result in the fewest updates. 
For (b), the assumption is that when the routes differ, 
the migrate-to router will not change to the routes the 
migrate-from selected. Whereas in (c), the assumption is 
that when the routes differ, the migrate-from router will 
change to the routes the migrate-from router selected. 
The reason they would change is that the routes learned 
from the remote end-point of the session being migrated 
will now be directly learned routes, rather than via iBGP. 
It is likely that the policy of route selection is consistent 
throughout the ISPs network, and therefore differences 
will be due to IGP distances and changing the router will 
change those routes to be more preferable. We are work- 
ing on characterizing when these differences would oc- 


cur in order to enable us to predict the impact a given mi- 
gration might have. Third, and perhaps most important, 
migration can be performed with minimal disruption to 
other routers in the likely scenario where there are few 
differences in routes selected. 


$ Related Work 


High availability and ease of network management are 
goals of many systems, and therefore router grafting has 
much in common with them. In particular, ones that 
attempt to minimize disruptions during planned main- 
tenance. One possibility is to reconfigure the routing 
protocols such that traffic will no longer be sent to the 
router about to undergo maintenance [16, 17]. Alter- 
natively, others have decoupled the control plane and 
data plane such that the router can continue to forward 
packets while the control plane goes off-line (e.g., re- 
booted) [18, 19]. However, unlike router grafting, these 
require modifications to the remote end-point router and 
they are only useful for temporarily shutting down the 
session on a given physical router, rather than enabling 
the session to come back up on a different router as in 
router grafting. 

In this regard, router grafting shares more in common 
with VROOM [3], which makes use of virtual machine 
migration [20] to ease network management. Mainte- 
nance could be performed without taking down the router 
simply by migrating the virtual router to another phys- 
ical router. This requires the two physical routers to 
be compatible (running the same virtualization technol- 
ogy), a limitation router grafting does not have. In fact, 
router grafting does not rely on virtual machine technol- 
ogy. Kozuch showed the ability to migrate without the 
use of virtualization [21], but did so at the granularity 
of the entire operating system and all running processes. 
With a coarse granularity, the physical router where the 
virtual router is being migrated to must be able to handle 
the entire virtual router’s load. 

Router grafting is also similar to the RouterFarm 
work [6], which targeted re-homing a customer. How- 
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ever, it required restarting the session and is more dis- 
ruptive than router grafting. Along similar lines, high- 
availability routers enable switching over to a different 
router or blade in a router [22]. This, however, is done 
either through periodically check-pointing, which pre- 
serves the memory image, or running two complete in- 
stances of the router software concurrently, which is an 
inefficient use of resources. 

While we presented router grafting in the context of a 
BGP session, we envision it being more general. Along 
these lines, partitioning the prefix space across multiple 
routers or blades is a possibility. ViAgegre [23] partitions 
the prefix space across multiple routers, however it is 
a static architecture not one which dynamically reparti- 
tions the prefix space as router grafting could. 

Finally, we made use of TCP socket migration to han- 
dle change or disruption in end-points. One alternative 
is to modify the TCP protocol to include the ability to 
change IP addresses [24]. Since the IP address of the 
end-points in router grafting can remain the same, we do 
not need this capability, but could make use of it. 


9 Conclusions 


Router grafting is a new technique that opens many new 
possibilities for managing a network. It does this by en- 
abling, without disruption, the migration of a routing ses- 
sion between (i) physical routers, (11) blades in a cluster 
router, and (111) routers from different vendors. We were 
able to do this while being transparent to the remote end- 
point. We handled the changes in topology through in- 
cremental updates, only sending out the necessary up- 
dates to convey the difference. Importantly, we did not 
affect the correctness of the network as the data plane 
will continue to forward packets and routing updates do 
not cause the migration to be aborted. 

Going forward, we plan to explore the motivating ap- 
plications for router grafting to further demonstrate the 
usefulness of our new technique. We are particularly in- 
terested in exploring the role of router grafting in traffic 
engineering. Finally, this work raises interesting ques- 
tions about what exactly a router is, and the various ways 
routers can be “sliced and diced.” We plan to explore 
these questions in our ongoing work. 
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ABSTRACT 


Networks are a shared resource connecting critical IT in- 
frastructure, and the general practice is to always leave 
them on. Yet, meaningful energy savings can result from 
improving a network’s ability to scale up and down, as 
traffic demands ebb and flow. We present Elastic Tree, a 
network-wide power! manager, which dynamically ad- 
justs the set of active network elements — links and 
switches — to satisfy changing data center traffic loads. 

We first compare multiple strategies for finding 
minimum-power network subsets across a range of traf- 
fic patterns. We implement and analyze ElasticTree 
on a prototype testbed built with production OpenFlow 
switches from three network vendors. Further, we ex- 
amine the trade-offs between energy efficiency, perfor- 
mance and robustness, with real traces from a produc- 
tion e-commerce website. Our results demonstrate that 
for data center workloads, Elastic Tree can save up to 
50% of network energy, while maintaining the ability to 
handle traffic surges. Our fast heuristic for computing 
network subsets enables ElasticTree to scale to data cen- 
ters containing thousands of nodes. We finish by show- 
ing how a network admin might configure ElasticTree to 
satisfy their needs for performance and fault tolerance, 
while minimizing their network power bill. 


1. INTRODUCTION 


Data centers aim to provide reliable and scalable 
computing infrastructure for massive Internet ser- 
vices. ‘lo achieve these properties, they consume 
huge amounts of energy, and the resulting opera- 
tional costs have spurred interest in improving their 
efficiency. Most efforts have focused on servers and 
cooling, which account for about 70% of a data cen- 
ter’s total power budget. Improvements include bet- 
ter components (low-power CPUs [12], more effi- 
cient power supplies and water-cooling) as well as 
better software (tickless kernel, virtualization, and 
smart cooling |30)). 

With energy management schemes for the largest 
power consumers well in place, we turn to a part of 
the data center that consumes 10-20% of its total 


‘We use power and energy interchangeably in this paper. 


power: the network [9]. The total power consumed 
by networking elements in data centers in 2006 in 
the U.S. alone was 3 billion kWh and rising [7]; our 
goal is to significantly reduce this rapidly growing 
energy cost. 


1.1 Data Center Networks 


As services scale beyond ten thousand servers, 
inflexibility and insufficient bisection bandwidth 
have prompted researchers to explore alternatives 
to the traditional 2N tree topology (shown in Fig- 
ure I(a)) [1] with designs such as VL2 [10], Port- 
Land [24], DCell [16], and BCube [15]. The re- 
sulting networks look more like a mesh than a tree. 
One such example, the fat tree [1]?, seen in Figure 
1(b), is built from a large number of richly connected 
switches, and can support any communication pat- 
tern (i.e. full bisection bandwidth). Traffic from 
lower layers is spread across the core, using multi- 
path routing, valiant load balancing, or a number of 
other techniques. 

In a 2N tree, one failure can cut the effective bi- 
section bandwidth in half, while two failures can dis- 
connect servers. Richer, mesh-like topologies handle 
failures more gracefully; with more components and 
more paths, the effect of any individual component 
failure becomes manageable. This property can also 
help improve energy efficiency. In fact, dynamically 
varying the number of active (powered on) network 
elements provides a control knob to tune between 
energy efficiency, performance, and fault tolerance, 
which we explore in the rest of this paper. 


1.2. Inside a Data Center 


Data centers are typically provisioned for peak 
workload, and run well below capacity most of the 
time. Traffic varies daily (e.g., email checking during 
the day), weekly (e.g., enterprise database queries 
on weekdays), monthly (e.g., photo sharing on holi- 
days), and yearly (e.g., more shopping in December). 
Rare events like cable cuts or celebrity news may hit 
the peak capacity, but most of the time traffic can 
be satisfied by a subset of the network links and 


Essentially a buffered Clos topology. 
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Figure 1: Data Center Networks: (a), 2N Tree (b), Fat Tree (c), ElasticTree 
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Figure 2: E-commerce website: 292 produc- 
tion web servers over 5 days. ‘Traffic varies 
by day/weekend, power doesn’t. 


switches. ‘hese observations are based on traces 
collected from two production data centers. 

Trace 1 (Figure 2) shows aggregate traffic col- 
lected from 292 servers hosting an e-commerce ap- 
plication over a 5 day period in April 2008 [22]. A 
clear diurnal pattern emerges; traffic peaks during 
the day and falls at night. Even though the traffic 
varies significantly with time, the rack and aggre- 
gation switches associated with these servers draw 
constant power (secondary axis in Figure 2). 

Trace 2 (Figure 3) shows input and output traffic 
at a router port in a production Google data center 
in September 2009. The Y axis is in Mbps. The 8- 
day trace shows diurnal and weekend/weekday vari- 
ation, along with a constant amount of background 
traffic. The 1-day trace highlights more short-term 
bursts. Here, as in the previous case, the power 
consumed by the router is fixed, irrespective of the 
traffic through it. 


1.3 Energy Proportionality 


An earlier power measurement study [22] had pre- 
sented power consumption numbers for several data 
center switches for a variety of traffic patterns and 
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Figure 3: Google Production Data Center 


switch configurations. We use switch power mea- 
surements from this study and summarize relevant 
results in Table 1. In all cases, turning the switch on 
consumes most of the power; going from zero to full 
traffic increases power by less than 8%. Turning off a 
switch yields the most power benefits, while turning 
off an unused port saves only 1-2 Watts. Ideally, an 
unused switch would consume no power, and energy 
usage would grow with increasing traffic load. Con- 
suming energy in proportion to the load is a highly 
desirable behavior [4, 22]. 

Unfortunately, today’s network elements are not 
energy proportional: fixed overheads such as fans, 
switch chips, and transceivers waste power at low 
loads. ‘The situation is improving, as competition 
encourages more efficient products, such as closer- 
to-energy-proportional links and switches [19, 18, 
26, 14]. However, maximum efficiency comes from a 
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Ports Port Model A Model B Model C 

Enabled ‘Traffic power (W) power (W) power (W) 
None None 151 133 76 
All None 184 170 97 
All 1 Gbps 195 ia 102 


Table 1: Power consumption of various 48- 
port switches for different configurations 


combination of improved components and improved 
component management. 

Our choice — as presented in this paper — is to 
manage today’s non energy-proportional network 
components more intelligently. By zooming out to 
a whole-data-center view, a network of on-or-off, 
non-proportional components can act as an energy- 
proportional ensemble, and adapt to varying traffic 
loads. The strategy is simple: turn off the links and 
switches that we don’t need, right now, to keep avail- 
able only as much networking capacity as required. 


1.4 Our Approach 


ElasticTree is a network-wide energy optimizer 
that continuously monitors data center traffic con- 
ditions. It chooses the set of network elements that 
must stay active to meet performance and fault tol- 
erance goals; then it powers down as many unneeded 
links and switches as possible. We use a variety of 
methods to decide which subset of links and switches 
to use, including a formal model, greedy bin-packer, 
topology-aware heuristic, and prediction methods. 
We evaluate ElasticTree by using it to control the 
network of a purpose-built cluster of computers and 
switches designed to represent a data center. Note 
that our approach applies to currently-deployed net- 
work devices, as well as newer, more energy-efficient 
ones. It applies to single forwarding boxes in a net- 
work, as well as individual switch chips within a 
large chassis-based router. 

While the energy savings from powering off an 
individual switch might seem insignificant, a large 
data center hosting hundreds of thousands of servers 
will have tens of thousands of switches deployed. 
The energy savings depend on the traffic patterns, 
the level of desired system redundancy, and the size 
of the data center itself. Our experiments show that, 
on average, savings of 25-40% of the network en- 
ergy in data centers is feasible. Extrapolating to all 
data centers in the U.S., we estimate the savings to 
be about 1 billion KWhr annually (based on 3 bil- 
lion kWh used by networking devices in U.S. data 
centers [7]). Additionally, reducing the energy con- 
sumed by networking devices also results in a pro- 
portional reduction in cooling costs. 
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Figure 4: System Diagram 
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The remainder of the paper is organized as fol- 
lows: 82 describes in more detail the ElasticTree 
approach, plus the modules used to build the pro- 
totype. 83 computes the power savings possible for 
different communication patterns to understand best 
and worse-case scenarios. We also explore power 
savings using real data center traffic traces. In 84, 
we measure the potential impact on bandwidth and 
latency due to ElasticTree. In 85, we explore deploy- 
ment aspects of Elastic'Tree in a real data center. 
We present related work in 86 and discuss lessons 
learned in 87. 


2. ELASTICTREE 


ElasticTree is a system for dynamically adapting 
the energy consumption of a data center network. 
ElasticTree consists of three logical modules - opti- 
mizer, routing, and power control - as shown in Fig- 
ure 4. The optimizer’s role is to find the minimum- 
power network subset which satisfies current traffic 
conditions. Its inputs are the topology, traffic ma- 
trix, a power model for each switch, and the desired 
fault tolerance properties (spare switches and spare 
capacity). The optimizer outputs a set of active 
components to both the power control and routing 
modules. Power control toggles the power states of 
ports, linecards, and entire switches, while routing 
chooses paths for all flows, then pushes routes into 
the network. 

We now show an example of the system in action. 


2.1 Example 


Figure 1(c) shows a worst-case pattern for network 
locality, where each host sends one data flow halfway 
across the data center. In this example, 0.2 Gbps 
of traffic per host must traverse the network core. 
When the optimizer sees this traffic pattern, it finds 
which subset of the network is sufficient to satisfy 
the traffic matrix. In fact, a minimum spanning tree 
(MST) is sufficient, and leaves 0.2 Gbps of extra 
capacity along each core link. ‘The optimizer then 
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informs the routing module to compress traffic along 
the new sub-topology, and finally informs the power 
control module to turn off unneeded switches and 
links. We assume a 3:1 idle:active ratio for modeling 
switch power consumption; that is, 3W of power to 
have a switch port, and 1 W extra to turn it on, based 
on the 48-port switch measurements shown in Table 
1. In this example, 13/20 switches and 28/48 links 
stay active, and Elastic'lree reduces network power 
by 38%. 

As traffic conditions change, the optimizer con- 
tinuously recomputes the optimal network subset. 
As traffic increases, more capacity is brought online, 
until the full network capacity is reached. As traffic 
decreases, switches and links are turned off. Note 
that when traffic is increasing, the system must wait 
for capacity to come online before routing through 
that capacity. In the other direction, when traffic 
is decreasing, the system must change the routing 
- by moving flows off of soon-to-be-down links and 
switches - before power control can shut anything 
down. 

Of course, this example goes too far in the direc- 
tion of power efficiency. The MST solution leaves the 
network prone to disconnection from a single failed 
link or switch, and provides little extra capacity to 
absorb additional traffic. Furthermore, a network 
operated close to its capacity will increase the chance 
of dropped and/or delayed packets. Later sections 
explore the tradeoffs between power, fault tolerance, 
and performance. Simple modifications can dramat- 
ically improve fault tolerance and performance at 
low power, especially for larger networks. We now 
describe each of Elastic'Tree modules in detail. 


2.2 Optimizers 


We have developed a range of methods to com- 
pute a minimum-power network subset in Elastic- 
Tree, as summarized in ‘Table 2. The first method is 
a formal model, mainly used to evaluate the solution 
quality of other optimizers, due to heavy computa- 
tional requirements. The second method is greedy 
bin-packing, useful for understanding power savings 
for larger topologies. The third method is a simple 
heuristic to quickly find subsets in networks with 
regular structure. Each method achieves different 
tradeoffs between scalability and optimality. All 
methods can be improved by considering a data cen- 
ter’s past traffic history (details in 85.4). 


2.2.1 Formal Model 


We desire the optimal-power solution (subset and 
flow assignment) that satisfies the traffic constraints, 


>Bounded percentage from optimal, configured to 10%. 
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Type Quality Scalability Input Topo 
Formal Optimal? = Low __ Traffic Matrix Any 
Greedy Good Medium Traffic Matrix Any 
Topo- OK High Port Counters Fat 
aware Tree 


Table 2: Optimizer Comparison 


but finding the optimal flow assignment alone is an 
NP-complete problem for integer flows. Despite this 
computational complexity, the formal model pro- 
vides a valuable tool for understanding the solution 
quality of other optimizers. It is flexible enough to 
support arbitrary topologies, but can only scale up 
to networks with less than 1000 nodes. 

The model starts with a standard multi- 
commodity flow (MCF) problem. For the precise 
MCF formulation, see Appendix A. The constraints 
include link capacity, flow conservation, and demand 
satisfaction. The variables are the flows along each 
link. The inputs include the topology, switch power 
model, and traffic matrix. To optimize for power, we 
add binary variables for every link and switch, and 
constrain traffic to only active (powered on) links 
and switches. ‘The model also ensures that the full 
power cost for an Ethernet link is incurred when ei- 
ther side is transmitting; there is no such thing as a 
half-on Ethernet link. 

The optimization goal is to minimize the total net- 
work power, while satisfying all constraints. Split- 
ting a single flow across multiple links in the topol- 
ogy might reduce power by improving link utilization 
overall, but reordered packets at the destination (re- 
sulting from varying path delays) will negatively im- 
pact TCP performance. Therefore, we include con- 
straints in our formulation to (optionally) prevent 
flows from getting split. 

The model outputs a subset of the original topol- 
ogy, plus the routes taken by each flow to satisfy 
the trafic matrix. Our model shares similar goals to 
Chabarek et al. |6|], which also looked at power-aware 
routing. However, our model (1) focuses on data 
centers, not wide-area networks, (2) chooses a sub- 
set of a fixed topology, not the component (switch) 
configurations in a topology, and (3) considers indi- 
vidual flows, rather than aggregate traffic. 

We implement our formal method using both 
MathProg and General Algebraic Modeling System 
(GAMS), which are high-level languages for opti- 
mization modeling. We use both the GNU Linear 
Programming Kit (GLPK) and CPLEX to solve the 
formulation. 
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2.2.2 Greedy Bin-Packing 


For even simple traffic patterns, the formal 
model’s solution time scales to the 3.5” power as a 
function of the number of hosts (details in §5). The 
ereedy bin-packing heuristic improves on the formal 
model’s scalability. Solutions within a bound of opti- 
mal are not guaranteed, but in practice, high-quality 
subsets result. For each flow, the greedy bin-packer 
evaluates possible paths and chooses the leftmost 
one with sufficient capacity. By leftmost, we mean 
in reference to a single layer in a structured topol- 
ogy, such as a fat tree. Within a layer, paths are 
chosen in a deterministic left-to-right order, as op- 
posed to a random order, which would evenly spread 
flows. When all flows have been assigned (which is 
not guaranteed), the algorithm returns the active 
network subset (set of switches and links traversed 
by some flow) plus each flow path. 

For some traffic matrices, the greedy approach will 
not find a satisfying assignment for all flows; this 
is an inherent problem with any greedy flow assign- 
ment strategy, even when the network is provisioned 
for full bisection bandwidth. In this case, the greedy 
search will have enumerated all possible paths, and 
the flow will be assigned to the path with the lowest 
load. Like the model, this approach requires knowl- 
edge of the traffic matrix, but the solution can be 
computed incrementally, possibly to support on-line 
usage. 


2.2.3 Topology-aware Heuristic 


The last method leverages the regularity of the fat 
tree topology to quickly find network subsets. Unlike 
the other methods, it does not compute the set of 
flow routes, and assumes perfectly divisible flows. Of 
course, by splitting flows, it will pack every link to 
full utilization and reduce TCP bandwidth — not 
exactly practical. 

However, simple additions to this “starter sub- 
set” lead to solutions of comparable quality to other 
methods, but computed with less information, and 
in a fraction of the time. In addition, by decoupling 
power optimization from routing, our method can 
be applied alongside any fat tree routing algorithm, 
including OSPF-ECMP, valiant load balancing [10], 
flow classification [1] [2], and end-host path selec- 
tion [23]. Computing this subset requires only port 
counters, not a full traffic matrix. 

The intuition behind our heuristic is that to satisfy 
traffic demands, an edge switch doesn’t care which 
aggregation switches are active, but instead, how 
many are active. The “view” of every edge switch in 
a given pod is identical; all see the same number of 
aggregation switches above. The number of required 


switches in the aggregation layer is then equal to the 
number of links required to support the traffic of 
the most active source above or below (whichever is 
higher), assuming flows are perfectly divisible. For 
example, if the most active source sends 2 Gbps of 
traffic up to the aggregation layer and each link is 
1 Gbps, then two aggregation layer switches must 
stay on to satisfy that demand. A similar observa- 
tion holds between each pod and the core, and the 
exact subset computation is described in more detail 
in 85. One can think of the topology-aware heuristic 
as a cron job for that network, providing periodic 
input to any fat tree routing algorithm. 

For simplicity, our computations assume a homo- 
geneous fat tree with one link between every con- 
nected pair of switches. However, this technique 
applies to full-bisection-bandwidth topologies with 
any number of layers (we show only 3 stages), bun- 
dled links (parallel links connecting two switches), 
or varying speeds. Extra “switches at a given layer” 
computations must be added for topologies with 
more layers. Bundled links can be considered sin- 
ele faster links. The same computation works for 
other topologies, such as the aggregated Clos used 
by VL2 [10], which has 10G links above the edge 
layer and 1G links to each host. 

We have implemented all three optimizers; each 
outputs a network topology subset, which is then 
used by the control software. 


2.3 Control Software 


ElasticTree requires two network capabilities: 
traffic data (current network utilization) and control 
over flow paths. NetFlow [27], SNMP and sampling 
can provide traffic data, while policy-based rout- 
ing can provide path control, to some extent. In 
our ElasticTree prototype, we use OpenF'low [29] to 
achieve the above tasks. 

OpenFlow: OpenFlow is an open API added 
to commercial switches and routers that provides a 
flow table abstraction. We first use OpenFlow to 
validate optimizer solutions by directly pushing the 
computed set of application-level flow routes to each 
switch, then generating traffic as described later in 
this section. In the live prototype, OpenFlow also 
provides the traffic matrix (flow-specific counters), 
port counters, and port power control. OpenFlow 
enables us to evaluate ElasticTree on switches from 
different vendors, with no source code changes. 

NOX: NOX is a centralized platform that pro- 
vides network visibility and control atop a network 
of OpenFlow switches [13]. The logical modules 
in Elastic'Tree are implemented as a NOX applica- 
tion. The application pulls flow and port counters, 
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Figure 5: Hardware Testbed (HP switch for 
k = 6 fat tree) 


Vendor Model k_ Virtual Switches Ports Hosts 
HP 5400 6 Ad 210 5A 
Quanta LB4G 4 20 80 16 
NEC IP8800 4 20 80 16 


Table 3: Fat Tree Configurations 


directs these to an optimizer, and then adjusts flow 
routes and port status based on the computed sub- 
set. In our current setup, we do not power off in- 
active switches, due to the fact that our switches 
are virtual switches. However, in a real data cen- 
ter deployment, we can leverage any of the existing 
mechanisms such as command line interface, SNMP 
or newer control mechanisms such as power-control 
over OpenF low in order to support the power control 
features. 


2.4 Prototype Testbed 


We build multiple testbeds to verify and evaluate 
Elastic'Tree, summarized in Table 3, with an exam- 
ple shown in Figure 5. Each configuration multi- 
plexes many smaller virtual switches (with 4 or 6 
ports) onto one or more large physical switches. All 
communication between virtual switches is done over 
direct links (not through any switch backplane or in- 
termediate switch). 

The smaller configuration is a complete k = 4 
three-layer homogeneous fat tree*, split into 20 in- 
dependent four-port virtual switches, supporting 16 
nodes at 1 Gbps apiece. One instantiation com- 
prised 2 NEC IP8800 24-port switches and 1 48- 
port switch, running OpenF low v0.8.9 firmware pro- 
vided by NEC Labs. Another comprised two Quanta 
LB4G 48-port switches, running the OpenFlow Ref- 
erence Broadcom firmware. 


*Refer [1] for details on fat trees and definition of k 
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Figure 6: Measurement Setup 


The larger configuration is a complete k = 6 
three-layer fat tree, split into 45 independent six- 
port virtual switches, supporting 54 hosts at 1 Gbps 
apiece. This configuration runs on one 288-port HP 
ProCurve 5412 chassis switch or two 144-port 5406 
chassis switches, running OpenF low v0.8.9 firmware 
provided by HP Labs. 


2.5 Measurement Setup 


Evaluating ElasticTree requires infrastructure to 
generate a small data center’s worth of traffic, plus 
the ability to concurrently measure packet drops and 
delays. ‘Io this end, we have implemented a NetF- 
PGA based traffic generator and a dedicated latency 
monitor. The measurement architecture is shown in 
Figure 6. 

NetF PGA Traffic Generators. The NetFPGA 
Packet Generator provides deterministic, line-rate 
traffic generation for all packet sizes [28]. Each 
NetF PGA emulates four servers with 1GE connec- 
tions. Multiple traffic generators combine to emulate 
a larger group of independent servers: for the k=6 
fat tree, 14 NetFPGAs represent 54 servers, and for 
the k=4 fat tree,4 NetF PGAs represent 16 servers. 

At the start of each test, the traffic distribu- 
tion for each port is packed by a weighted round 
robin scheduler into the packet generator SRAM. All 
packet generators are synchronized by sending one 
packet through an Ethernet control port; these con- 
trol packets are sent consecutively to minimize the 
start-time variation. After sending traffic, we poll 
and store the transmit and receive counters on the 
packet generators. 

Latency Monitor. The latency monitor PC 
sends tracer packets along each packet path. ‘Tracers 
enter and exit through a different port on the same 
physical switch chip; there is one Ethernet port on 
the latency monitor PC per switch chip. Packets are 
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logged by Pcap on entry and exit to record precise 
timestamp deltas. We report median figures that are 
averaged over all packet paths. ‘[o ensure measure- 
ments are taken in steady state, the latency moni- 
tor starts up after 100 ms. This technique captures 
all but the last-hop egress queuing delays. Since 
edge links are never oversubscribed for our traffic 
patterns, the last-hop egress queue should incur no 


added delay. 


3. POWER SAVINGS ANALYSIS 


In this section, we analyze ElasticTree’s network 
energy savings when compared to an always-on base- 
line. Our comparisons assume a homogeneous fat 
tree for simplicity, though the evaluation also applies 
to full-bisection-bandwidth topologies with aggrega- 
tion, such as those with 1G links at the edge and 
10G at the core. The primary metric we inspect is 
% original network power, computed as: 


__ Power consumed by ElasticTree x 100 
Power consumed by original fat-tree 


This percentage gives an accurate idea of the over- 
all power saved by turning off switches and links 
(i.e., savings equal 100 - % original power). We 
use power numbers from switch model A (81.3) for 
both the baseline and ElasticTree cases, and only 
include active (powered-on) switches and links for 
ElasticTree cases. Since all three switches in Ta- 
ble 1 have an idle:active ratio of 3:1 (explained in 
§2.1), using power numbers from switch model B 
or C will yield similar network energy savings. Un- 
less otherwise noted, optimizer solutions come from 
the greedy bin-packing algorithm, with flow splitting 
disabled (as explained in Section 2). We validate the 
results for all k = {4,6} fat tree topologies on mul- 
tiple testbeds. For all communication patterns, the 
measured bandwidth as reported by receive counters 
matches the expected values. We only report energy 
saved directly from the network; extra energy will be 
required to power on and keep running the servers 
hosting Elastic'‘Iree modules. ‘There will be addi- 
tional energy required for cooling these servers, and 
at the same time, powering off unused switches will 
result in cooling energy savings. We do not include 
these extra costs/savings in this paper. 


3.1 Traffic Patterns 


Energy, performance and robustness all depend 
heavily on the traffic pattern. We now explore the 
possible energy savings over a wide range of commu- 
nication patterns, leaving performance and robust- 
ness for $4. 
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Figure 7: Power savings as a function of de- 
mand, with varying traffic locality, for a 28K- 
node, k=48 fat tree 


3.1.1 Uniform Demand, Varying Locality 


First, consider two extreme cases: near (highly 
localized) traffic matrices, where servers commu- 
nicate only with other servers through their edge 
switch, and far (non-localized) traffic matrices 
where servers communicate only with servers in 
other pods, through the network core. In this pat- 
tern, all traffic stays within the data center, and 
none comes from outside. Understanding these ex- 
treme cases helps to quantify the range of network 
energy savings. Here, we use the formal method as 
the optimizer in Elastic'Tree. 

Near traffic is a best-case — leading to the largest 
energy savings — because ElasticTree will reduce 
the network to the minimum spanning tree, switch- 
ing off all but one core switch and one aggregation 
switch per pod. On the other hand, far traffic is a 
worst-case — leading to the smallest energy savings 
— because every link and switch in the network is 
needed. For far traffic, the savings depend heavily 
on the network utilization, u = pues (Ai; is the 
traffic from host 7 to host 7, Az; < 1 Gbps). If wu is 
close to 100%, then all links and switches must re- 
main active. However, with lower utilization, traffic 
can be concentrated onto a smaller number of core 
links, and unused ones switch off. Figure 7 shows 
the potential savings as a function of utilization for 
both extremes, as well as traffic to the aggregation 
layer Mid), for a k = 48 fat tree with roughly 28K 
servers. Running ElasticTree on this configuration, 
with near traffic at low utilization, we expect a net- 
work energy reduction of 60%; we cannot save any 
further energy, as the active network subset in this 
case is the MST. For far traffic and u=100%, there 
are no energy savings. ‘This graph highlights the 
power benefit of local communications, but more im- 
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Figure 8: Scatterplot of power savings with 
random traffic matrix. Each point on the 
graph corresponds to a pre-configured aver- 
age data center workload, for a k = 6 fat tree 


portantly, shows potential savings in all cases. Hav- 
ing seen these two extremes, we now consider more 
realistic traffic matrices with a mix of both near and 
far traffic. 


3.1.2 Random Demand 


Here, we explore how much energy we can expect 
to save, on average, with random, admissible traf- 
fic matrices. Figure 8 shows energy saved by Elas- 
ticTree (relative to the baseline) for these matrices, 
generated by picking flows uniformly and randomly, 
then scaled down by the most oversubscribed host’s 
traffic to ensure admissibility. As seen previously, 
for low utilization, ElasticTree saves roughly 60% of 
the network power, regardless of the traffic matrix. 
As the utilization increases, traffic matrices with sig- 
nificant amounts of far traffic will have less room for 
power savings, and so the power saving decreases. 
The two large steps correspond to utilizations at 
which an extra aggregation switch becomes neces- 
sary across all pods. The smaller steps correspond 
to individual aggregation or core switches turning on 
and off. Some patterns will densely fill all available 
links, while others will have to incur the entire power 
cost of a switch for a single link; hence the variabil- 
ity in some regions of the graph. Utilizations above 
0.75 are not shown; for these matrices, the greedy 
bin-packer would sometimes fail to find a complete 
satisfying assignment of flows to links. 


3.1.3 Sine-wave Demand 


As seen before (§1.2), the utilization of a data cen- 
ter will vary over time, on daily, seasonal and annual 
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Figure 9: Power savings for sinusoidal traffic 
variation in a k = 4 fat tree topology, with 1 
flow per host in the traffic matrix. The input 
demand has 10 discrete values. 


time scales. Figure 9 shows a time-varying utiliza- 
tion; power savings from ElasticTree that follow the 
utilization curve. To crudely approximate diurnal 
variation, we assume u = 1/2(1+sin(t)), at time ft, 
suitably scaled to repeat once per day. For this sine 
wave pattern of traffic demand, the network power 
can be reduced up to 64% of the original power con- 
sumed, without being over-subscribed and causing 
congestion. 

We note that most energy savings in all the above 
communication patterns comes from powering off 
switches. Current networking devices are far from 
being energy proportional, with even completely idle 
switches (0% utilization) consuming 70-80% of their 
fully loaded power (100% utilization) [22]; thus pow- 
ering off switches yields the most energy savings. 


3.1.4 Traffic in a Realistic Data Center 


In order to evaluate energy savings with a real 
data center workload, we collected system and net- 
work traces from a production data center hosting an 
e-commerce application (Trace 1, §1). The servers 
in the data center are organized in a tiered model as 
application servers, file servers and database servers. 
The System Activity Reporter (sar) toolkit available 
on Linux obtains CPU, memory and network statis- 
tics, including the number of bytes transmitted and 
received from 292 servers. Our traces contain statis- 
tics averaged over a 10-minute interval and span 5 
days in April 2008. The aggregate traffic through 
all the servers varies between 2 and 12 Gbps at any 
given time instant (Figure 2). Around 70% of the 
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Figure 10: Energy savings for production 
data center (e-commerce website) traces, over 
a 5 day period, using a k=12 fat tree. We 
show savings for different levels of overall 
traffic, with 70% destined outside the DC. 


traffic leaves the data center and the remaining 30% 
is distributed to servers within the data center. 

In order to compute the energy savings from Elas- 
tic'Tree for these 292 hosts, we need a k = 12 fat 
tree. Since our testbed only supports k = 4 and 
k = 6 sized fat trees, we simulate the effect of Elas- 
tic'Iree using the greedy bin-packing optimizer on 
these traces. A fat tree with k = 12 can support up 
to 432 servers; since our traces are from 292 servers, 
we assume the remaining 140 servers have been pow- 
ered off. The edge switches associated with these 
powered-off servers are assumed to be powered off; 
we do not include their cost in the baseline routing 
power calculation. 

The e-commerce service does not generate enough 
network traffic to require a high bisection bandwidth 
topology such as a fat tree. However, the time- 
varying characteristics are of interest for evaluating 
ElasticTree, and should remain valid with propor- 
tionally larger amounts of network traffic. Hence, 
we scale the traffic up by a factor of 20. 

For different scaling factors, as well as for different 
intra data center versus outside data center (exter- 
nal) traffic ratios, we observe energy savings ranging 
from 25-62%. We present our energy savings results 
in Figure 10. ‘The main observation when visually 
comparing with Figure 2 is that the power consumed 
by the network follows the traffic load curve. Even 
though individual network devices are not energy- 
proportional, Elastic'Iree introduces energy propor- 
tionality into the network. 
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Figure 11: Power cost of redundancy 
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Figure 12: Power consumption in a robust 
data center network with safety margins, as 
well as redundancy. Note “greedy+1” means 
we add a MST over the solution returned by 
the greedy solver. 


We stress that network energy savings are work- 
load dependent. While we have explored savings 
in the best-case and worst-case traffic scenarios as 
well as using traces from a production data center, 
a highly utilized and “never-idle” data center net- 
work would not benefit from running ElasticTree. 


3.2 Robustness Analysis 


Typically data center networks incorporate some 
level of capacity margin, as well as redundancy in 
the topology, to prepare for traffic surges and net- 
work failures. In such cases, the network uses more 
switches and links than essential for the regular pro- 
duction workload. 

Consider the case where only a minimum spanning 
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Figure 13: Queue Test Setups with one (left) 
and two (right) bottlenecks 


tree (MST) in the fat tree topology is turned on (all 
other links/switches are powered off); this subset 
certainly minimizes power consumption. However, 
it also throws away all path redundancy, and with 
it, all fault tolerance. In Figure 11, we extend the 
MST in the fat tree with additional active switches, 
for varying topology sizes. ‘The MST+1 configura- 
tion requires one additional edge switch per pod, 
and one additional switch in the core, to enable any 
single aggregation or core-level switch to fail with- 
out disconnecting a server. ‘The MST+2 configura- 
tion enables any two failures in the core or aggre- 
gation layers, with no loss of connectivity. As the 
network size increases, the incremental cost of addi- 
tional fault tolerance becomes an insignificant part 
of the total network power. For the largest networks, 
the savings reduce by only 1% for each additional 
spanning tree in the core aggregation levels. Each 
+1 increment in redundancy has an additive cost, 
but a multiplicative benefit; with MST+2, for exam- 
ple, the failures would have to happen in the same 
pod to disconnect a host. ‘This graph shows that the 
added cost of fault tolerance is low. 

Figure 12 presents power figures for the k=12 fat 
tree topology when we add safety margins for ac- 
commodating bursts in the workload. We observe 
that the additional power cost incurred is minimal, 
while improving the network’s ability to absorb un- 
expected traffic surges. 


4. PERFORMANCE 


The power savings shown in the previous section 
are worthwhile only if the performance penalty is 
negligible. In this section, we quantify the perfor- 
mance degradation from running traffic over a net- 
work subset, and show how to mitigate negative ef- 
fects with a safety margin. 


4.1 Queuing Baseline 


Figure 13 shows the setup for measuring the buffer 
depth in our test switches; when queuing occurs, 
this knowledge helps to estimate the number of hops 
where packets are delayed. In the congestion-free 
case (not shown), a dedicated latency monitor PC 
sends tracer packets into a switch, which sends it 
right back to the monitor. Packets are timestamped 


NSDI 710: 7th USENIX Symposium on Networked Systems Design and Implementation 


Bottlenecks Median Std. Dev 
0 36.00 2.94 
1 473.97 tol2 
2 914.45 10.50 


Table 4: Latency baselines for Queue Test Se- 
tups 
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Figure 14: Latency vs demand, with uniform 
traffic. 


by the kernel, and we record the latency of each re- 
ceived packet, as well as the number of drops. This 
test is useful mainly to quantify PC-induced latency 
variability. In the single-bottleneck case, two hosts 
send 0.7 Gbps of constant-rate traffic to a single 
switch output port, which connects through a second 
switch to a receiver. Concurrently with the packet 
generator traffic, the latency monitor sends tracer 
packets. In the double-bottleneck case, three hosts 
send 0.7 Gbps, again while tracers are sent. 

Table 4 shows the latency distribution of tracer 
packets sent through the Quanta switch, for all three 
cases. With no background traffic, the baseline la- 
tency is 36 us. In the single-bottleneck case, the 
egress buffer fills immediately, and packets expe- 
rience 474 us of buffering delay. For the double- 
bottleneck case, most packets are delayed twice, to 
914 us, while a smaller fraction take the single- 
bottleneck path. The HP switch (data not shown) 
follows the same pattern, with similar minimum la- 
tency and about 1500 us of buffer depth. All cases 
show low measurement variation. 


4.2 Uniform Traffic, Varying Demand 


In Figure 14, we see the latency totals for a uni- 
form traffic series where all traffic goes through the 
core to a different pod, and every hosts sends one 
flow. ‘To allow the network to reach steady state, 
measurements start 100 ms after packets are sent, 
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Figure 15: Drops vs overload with varying 
safety margins 


and continue until the end of the test, 900 ms later. 
All tests use 512-byte packets; other packet sizes 
yield the same results. ‘The graph covers packet 
generator traffic from idle to 1 Gbps, while tracer 
packets are sent along every flow path. If our solu- 
tion is feasible, that is, all flows on each link sum to 
less than its capacity, then we will see no dropped 
packets, with a consistently low latency. 

Instead, we observe sharp spikes at 0.25 Gbps, 
0.33 Gbps, and 0.5 Gbps. These spikes correspond 
to points where the available link bandwidth is ex- 
ceeded, even by a small amount. For example, when 
ElasticTree compresses four 0.25 Gbps flows along 
a single 1 Gbps link, Ethernet overheads (preamble, 
inter-frame spacing, and the CRC) cause the egress 
buffer to fill up. Packets either get dropped or sig- 
nificantly delayed. 

This example motivates the need for a safety 
margin to account for processing overheads, traffic 
bursts, and sustained load increases. ‘The issue is 
not just that drops occur, but also that every packet 
on an overloaded link experiences significant delay. 
Next, we attempt to gain insight into how to set the 
safety margin, or capacity reserve, such that perfor- 
mance stays high up to a known traffic overload. 


4.3 Setting Safety Margins 


Figures 15 and 16 show drops and latency as a 
function of traffic overload, for varying safety mar- 
gins. Safety margin is the amount of capacity re- 
served at every link by the optimizer; a higher safety 
margin provides performance insurance, by delaying 
the point at which drops start to occur, and aver- 
age latency starts to degrade. Traffic overload is 
the amount each host sends and receives beyond the 
original traffic matrix. The overload for a host is 
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Figure 16: Latency vs overload with varying 
safety margins 
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Figure 17: Computation time for different op- 
timizers as a function of network size 


spread evenly across all flows sent by that host. For 
example, at zero overload, a solution with a safety 
margin of 100 Mbps will prevent more than 900 
Mbps of combined flows from crossing each link. If 
a host sends 4 flows (as in these plots) at 100 Mbps 
overload, each flow is boosted by 25 Mbps. Each 
data point represents the average over 5 traffic ma- 
trices. In all matrices, each host sends to 4 randomly 
chosen hosts, with a total outgoing bandwidth se- 
lected uniformly between 0 and 0.5 Gbps. All tests 
complete in one second. 

Drops Figure 15 shows no drops for small 
overloads (up to 100 Mbps), followed by a steadily 
increasing drop percentage as overload increases. 
Loss percentage levels off somewhat after 500 Mbps, 
as some flows cap out at 1 Gbps and generate no 
extra traffic. As expected, increasing the safety 
margin defers the point at which performance 
degrades. 


NSDI 710: 7th USENIX Symposium on Networked Systems Design and Implementation 259 


260 


Latency In Figure 16, latency shows a trend sim- 
ilar to drops, except when overload increases to 200 
Mbps, the performance effect is more pronounced. 
For the 250 Mbps margin line, a 200 Mbps over- 
load results in 1% drops, however latency increases 
by 10x due to the few congested links. Some margin 
lines cross at high overloads; this is not to say that a 
smaller margin is outperforming a larger one, since 
drops increase, and we ignore those in the latency 
calculation. 

Interpretation Given these plots, a network op- 
erator can choose the safety margin that best bal- 
ances the competing goals of performance and en- 
ergy efficiency. For example, a network operator 
might observe from past history that the traffic av- 
erage never varies by more than 100 Mbps in any 
10 minute span. She considers an average latency 
under 100 us to be acceptable. Assuming that Elas- 
tic'Tree can transition to a new subset every 10 min- 
utes, the operator looks at 100 Mbps overload on 
each plot. She then finds the smallest safety margin 
with sufficient performance, which in this case is 150 
Mbps. The operator can then have some assurance 
that if the traffic changes as expected, the network 
will meet her performance criteria, while consuming 
the minimum amount of power. 


5. PRACTICAL CONSIDERATIONS 


Here, we address some of the practical aspects of 
deploying ElasticTree in a live data center environ- 
ment. 


5.1 Comparing various optimizers 


We first discuss the scalability of various optimiz- 
ers in Elastic'Iree, based on solution time vs network 
size, aS Shown in Figure 17. This analysis provides 
a sense of the feasibility of their deployment in a 
real data center. The formal model produces solu- 
tions closest to optimal; however for larger topolo- 
gies (such as fat trees with k >= 14), the time to 
find the optimal solution becomes intractable. For 
example, finding a network subset with the formal 
model with flow splitting enabled on CPLEX on a 
single core, 2 Ghz machine, for a k = 16 fat tree, 
takes about an hour. ‘The solution time growth 
of this carefully optimized model is about O(n*:°), 
where n is the number of hosts. We then ran the 
greedy-bin packer (written in unoptimized Python) 
on a single core of a 2.13 Ghz laptop with 3 GB of 
RAM. The no-split version scaled as about O(n?:°), 
while the with-split version scaled slightly better, 
as O(n”). The topology-aware heuristic fares much 
better, scaling as roughly O(n), as expected. Sub- 
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set computation for 10K hosts takes less than 10 
seconds for a single-core, unoptimized, Python im- 
plementation — faster than the fastest switch boot 
time we observed (30 seconds for the Quanta switch). 
This result implies that the topology-aware heuris- 
tic approach is not fundamentally unscalable, espe- 
cially considering that the number of operations in- 
creases linearly with the number of hosts. We next 
describe in detail the topology-aware heuristic, and 
show how small modifications to its “starter subset” 
can yield high-quality, practical network solutions, 
in little time. 


5.2 Topology-Aware Heuristic 


We describe precisely how to calculate the subset 
of active network elements using only port counters. 

Links. First, compute LEdge,®, the minimum 
number of active links exiting edge switch e in pod 
p to support up-traffic (edge — agg): 


LEdges®, = [( )_ F(e > a))/r| 


aGA,y 


A, is the set of aggregation switches in pod p, 
F(e — a) is the traffic flow from edge switch e to 
aggregation switch a, and r is the link rate. ‘The 
total up-traffic of e, divided by the link rate, equals 
the minimum number of links from e required to 
satisfy the up-traffic bandwidth. Similarly, compute 
LEdgeso’”, the number of active links exiting edge 
switch e in pod p to support down-traffic (agg —- 
edge): 


LEdges"" = [( }— F(a e))/r] 


acAy 


The maximum of these two values (plus 1, to en- 
sure a spanning tree at idle) gives LEdge, ¢, the min- 
imum number of links for edge switch e in pod p: 
Ebdge 1) 


p,e 


LE dgey,. = max{ LEdge;” 


P,e? 


Now, compute the number of active links from 
each pod to the core. LAgg;? is the minimum num- 
ber of links from pod p to the core to satisfy the 
up-traffic bandwidth (agg — core): 


2, 


c€C ace Ap,a—c 


LAgg,” = |( F(a — c))/r| 


down 


Hence, we find the number of up-links, LAgg;, 
used to support down-traffic (core — agg) in pod p: 


2 


c€C ac Ay,c—a 


LAggee’” = [( 


F(c— a))/r| 


The maximum of these two values (plus 1, to en- 
sure a spanning tree at idle) gives LAgg,, the mini- 


USENIX Association 


USENIX Association 


mum number of core links for pod p: 
LAggp = max{LEdge,”, LEdges*’"} 


Switches. For both the aggregation and core lay- 
ers, the number of switches follows directly from the 
link calculations, as every active link must connect 
to an active switch. First, we compute N Agg??, the 
minimum number of aggregation switches required 
to satisfy up-traffic (edge — agg) in pod p: 


NAgg,? = max{ LE dgep 


Next, compute NV Aggoo™", the minimum number 
of aggregation switches required to support down- 
traffic (core — agg) in pod p: 


N Agge?’” = [(LAgg6°’"/(k/2)] 


C’ is the set of core switches and k is the switch 
degree. The number of core links in the pod, divided 
by the number of links uplink in each aggregation 
switch, equals the minimum number of aggregation 
switches required to satisfy the bandwidth demands 
from all core switches. The maximum of these two 
values gives N Aggy, the minimum number of active 
aggregation switches in the pod: 


N Aggp = max{N Aggy’, N Aggr?””, 1} 


Finally, the traffic between the core and the most- 
active pod informs NCore, the number of core 
switches that must be active to satisfy the traffic 
demands: 


= up 
NCore pmax(LAgg, )] 


Robustness. The equations above assume that 
100% utilized links are acceptable. We can change 
r, the link rate parameter, to set the desired aver- 
age link utilization. Reducing r reserves additional 
resources to absorb traffic overloads, plus helps to 
reduce queuing delay. Further, if hashing is used to 
balance flows across different links, reducing r helps 
account for collisions. 

To add k-redundancy to the starter subset for im- 
proved fault tolerance, add k aggregation switches 
to each pod and the core, plus activate the links 
on all added switches. Adding k-redundancy can be 
thought of as adding k parallel MST’ that overlap 
at the edge switches. These two approaches can be 
combined for better robustness. 


5.3. Response Time 


The ability of Elastic'Tree to respond to spikes in 
traffic depends on the time required to gather statis- 
tics, compute a solution, wait for switches to boot, 
enable links, and push down new routes. We mea- 
sured the time required to power on/off links and 


switches in real hardware and find that the domi- 
nant time is waiting for the switch to boot up, which 
ranges from 30 seconds for the Quanta switch to 
about 3 minutes for the HP switch. Powering indi- 
vidual ports on and off takes about 1 — 3 seconds. 
Populating the entire flow table on a switch takes un- 
der 5 seconds, while reading all port counters takes 
less than 100 ms for both. Switch models in the fu- 
ture may support features such as going into various 
sleep modes; the time taken to wake up from sleep 
modes will be significantly faster than booting up. 
ElasticTree can then choose which switches to power 
off versus which ones to put to sleep. 

Further, the ability to predict traffic patterns for 
the next few hours for traces that exhibit regular 
behavior will allow network operators to plan ahead 
and get the required capacity (plus some safety mar- 
gin) ready in time for the next traffic spike. Al- 
ternately, a control loop strategy to address perfor- 
mance effects from burstiness would be to dynami- 
cally increase the safety margin whenever a thresh- 
old set by a service-level agreement policy were ex- 
ceeded, such as a percentage of packet drops. 


5.4 Traffic Prediction 


In all of our experiments, we input the entire traf- 
fic matrix to the optimizer, and thus assume that 
we have complete prior knowledge of incoming traf- 
fic. In a real deployment of Elastic'Tree, such an 
assumption is unrealistic. One possible workaround 
is to predict the incoming traffic matrix based on 
historical traffic, in order to plan ahead for expected 
traffic spikes or long-term changes. While predic- 
tion techniques are highly sensitive to workloads, 
they are more effective for traffic that exhibit regular 
patterns, such as our production data center traces 
(§3.1.4). We experiment with a simple auto regres- 
sive AR(1) prediction model in order to predict traf- 
fic to and from each of the 292 servers. We use traf- 
fic traces from the first day to train the model, then 
use this model to predict traffic for the entire 5 day 
period. Using the traffic prediction, the greedy bin- 
packer can determine an active topology subset as 
well as flow routes. 

While detailed traffic prediction and analysis are 
beyond the scope of this paper, our initial exper- 
imental results are encouraging. ‘They imply that 
even simple prediction models can be used for data 
center traffic that exhibits periodic (and thus pre- 
dictable) behavior. 


5.5 Fault Tolerance 


ElasticTree modules can be placed in ways that 
mitigate fault tolerance worries. In our testbed, the 
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routing and optimizer modules run on a single host 
PC. This arrangement ties the fate of the whole sys- 
tem to that of each module; an optimizer crash is 
capable of bringing down the system. 

Fortunately, the topology-aware heuristic — the 
optimizer most likely to be deployed — operates inde- 
pendently of routing. ‘The simple solution is to move 
the optimizer to a separate host to prevent slow 
computation or crashes from affecting routing. Our 
OpenF low switches support a passive listening port, 
to which the read-only optimizer can connect to grab 
port statistics. After computing the switch/link sub- 
set, the optimizer must send this subset to the rout- 
ing controller, which can apply it to the network. 
If the optimizer doesn’t check in within a fixed pe- 
riod of time, the controller should bring all switches 
up. The reliability of ElasticTree should be no worse 
than the optimizer-less original; the failure condition 
brings back the original network power, plus a time 
period with reduced network capacity. 

For optimizers tied to routing, such as the for- 
mal model and greedy bin-packer, known techniques 
can provide controller-level fault tolerance. In active 
standby, the primary controller performs all required 
tasks, while the redundant controllers stay idle. On 
failing to receive a periodic heartbeat from the pri- 
mary, a redundant controller becomes to the new pri- 
mary. This technique has been demonstrated with 
NOX, so we expect it to work with our system. In 
the more complicated full replication case, multiple 
controllers are simultaneously active, and state (for 
routing and optimization) is held consistent between 
them. For Elastic'Iree, the optimization calculations 
would be spread among the controllers, and each 
controller would be responsible for power control for 
a section of the network. For a more detailed discus- 
sion of these issues, see 83.5 “Replicating the Con- 
troller: Fault-Tolerance and Scalability” in [5]. 


6. RELATED WORK 


This paper tries to extend the idea of power pro- 
portionality into the network domain, as first de- 
scribed by Barroso et al. [4]. Gupta et al. [17| were 
amongst the earliest researchers to advocate con- 
serving energy in networks. They suggested putting 
network components to sleep in order to save en- 
ergy and explored the feasibility in a LAN setting 
in a later paper [18]. Several others have proposed 
techniques such as putting idle components in a 
switch (or router) to sleep [18] as well as adapting 
the link rate [14], including the IEEE 802.3az Task 
Force [19]. 

Chabarek et al. |6| use mixed integer programming 
to optimize router power in a wide area network, by 


NSDI 710: 7th USENIX Symposium on Networked Systems Design and Implementation 


choosing the chassis and linecard configuration to 
best meet the expected demand. In contrast, our 
formulation optimizes a data center local area net- 
work, finds the power-optimal network subset and 
routing to use, and includes an evaluation of our 
prototype. Further, we detail the tradeoffs associ- 
ated with our approach, including impact on packet 
latency and drops. 

Nedevschi et al. [26] propose shaping the traffic 
into small bursts at edge routers to facilitate putting 
routers to sleep. Their research is complementary to 
ours. Further, their work addresses edge routers in 
the Internet while our algorithms are for data cen- 
ters. In a recent work, Ananthanarayanan [3] et 
al. motivate via simulation two schemes - a lower 
power mode for ports and time window prediction 
techniques that vendors can implemented in future 
switches. While these and other improvements can 
be made in future switch designs to make them more 
energy efficient, most energy (70-80% of their total 
power) is consumed by switches in their idle state. 
A more effective way of saving power is using a traf- 
fic routing approach such as ours to maximize idle 
switches and power them off. Another recent pa- 
per [25] et al. discusses the benefits and deployment 
models of a network proxy that would allow end- 
hosts to sleep while the proxy keeps the network 
connection alive. 

Other complementary research in data center net- 
works has focused on scalability [24][10], switching 
layers that can incorporate different policies [20], or 
architectures with programmable switches [11]. 


7. DISCUSSION 


The idea of disabling critical network infrastruc- 
ture in data centers has been considered taboo. Any 
dynamic energy management system that attempts 
to achieve energy proportionality by powering off a 
subset of idle components must demonstrate that 
the active components can still meet the current of- 
fered load, as well as changing load in the immedi- 
ate future. The power savings must be worthwhile, 
performance effects must be minimal, and fault tol- 
erance must not be sacrificed. The system must pro- 
duce a feasible set of network subsets that can route 
to all hosts, and be able to scale to a data center 
with tens of thousands of servers. 

To this end, we have built ElasticTree, which 
through data-center-wide traffic management and 
control, introduces energy proportionality in today’s 
non-energy proportional networks. Our initial re- 
sults (covering analysis, simulation, and hardware 
prototypes) demonstrate the tradeoffs between per- 
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formance, robustness, and energy; the safety mar- 
gin parameter provides network administrators with 
control over these tradeoffs. ElasticTree’s ability to 
respond to sudden increases in traffic is currently 
limited by the switch boot delay, but this limita- 
tion can be addressed, relatively simply, by adding 
a sleep mode to switches. 

ElasticTree opens up many questions. For exam- 
ple, how will TCP-based application traffic interact 
with ElasticTree? TCP maintains link utilization in 
sawtooth mode; a network with primarily TCP flows 
might yield measured traffic that stays below the 
threshold for a small safety margin, causing Elas- 
ticTree to never increase capacity. Another ques- 
tion is the effect of increasing network size: a larger 
network probably means more, smaller flows, which 
pack more densely, and reduce the chance of queuing 
delays and drops. We would also like to explore the 
general applicability of the heuristic to other topolo- 
gies, such as hypercubes and butterflies. 

Unlike choosing between cost, speed, and relia- 
bility when purchasing a car, with ElasticTree one 
doesn’t have to pick just two when offered perfor- 
mance, robustness, and energy efficiency. During 
periods of low to mid utilization, and for a variety 
of communication patterns (as is often observed in 
data centers), ElasticTree can maintain the robust- 
ness and performance, while lowering the energy bill. 
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APPENDIX 
A. POWER OPTIMIZATION PROBLEM 


Our model is a multi-commodity flow formulation, 
augmented with binary variables for the power state 
of links and switches. It minimizes the total network 
power by solving a mixed-integer linear program. 


A.1l Multi-Commodity Network Flow 


Flow network G(V,F), has edges (u,v) € E 
with capacity c(u,v). There are k commodities 
Ky, Ko, are 8 kX Bs defined by K; = (sj, ee di), where, 
for commodity 2, s; is the source, t; is the sink, and 
d; is the demand. The flow of commodity 7 along 
edge (u,v) is fi(u,v). Find a flow assignment which 
satisfies the following three constraints [8]: 
Capacity constraints: The total flow along each 

link must not exceed the edge capacity. 


k 


V(u,v) € V, S- fi(u,v) < c(u, v) 


1=1 


Flow conservation: Commodities are neither 
created nor destroyed at intermediate nodes. 


Vi, > fi(u, w) = 0, when u ¥ s; and uF t; 


wEeV 


Demand satisfaction: Each source and sink sends 
or receives an amount equal to its demand. 


Vi, » (Siw) = Ss fi (w, ti) — d; 


wEeV wEeV 


A.2 Power Minimization Constraints 


Our formulation uses the following notation: 


S Set of all switches 

Vu Set of nodes connected to a switch u 
a(u,v) Power cost for link (u, v) 

b(u) Power cost for switch u 

ie Binary decision variable indicating 


whether link (u,v) is powered ON 


Yi Binary decision variable indicating 
whether switch u is powered ON 


E; Set of all unique edges used by flow 2 


Binary decision variable indicating 
whether commodity 7 uses link (u, v) 


r;(u, Vv) 


The objective function, which minimizes the total 
network power consumption, can be represented as: 


Minimize 0 (.)cn Xuv X@(U,v)+) ev Yu x d(u) 


The following additional constraints create a de- 
pendency between the flow routing and power states: 
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Deactivated links have no traffic: Flow is re 
stricted to only those links (and consequently 
the switches) that are powered on. Thus, for all 
links (u,v) used by commodity 7, f;(u,v) = 0, 
when X,,, = 0. Since the flow variable f is 
positive in our formulation, the linearized con- 
straint is: 

k 
Vi, V(u, uv) € E, ee) — CU,0) K Magy 
i=1 
The optimization objective inherently enforces 
the converse, which states that links with no 
traffic can be turned off. 


Link power is bidirectional: Both “halves” of 
an Ethernet link must be powered on if traffic 
is flowing in either direction: 


V(u, v) S B, Xu — Kai, 


Correlate link and switch decision variable: 
When a switch u is powered off, all links 
connected to this switch are also powered off: 


Ve Vj TUS Veg Nae — ope = Vy 


Similarly, when all links connecting to a switch 
are off, the switch can be powered off. The lin- 
earized constraint is: 


Yue v,Yy < S Dea 


weEV,, 


A.3 Flow Split Constraints 


Splitting flows is typically undesirable due to TCP 
packet reordering effects [21]. We can prevent flow 
splitting in the above formulation by adopting the 
following constraint, which ensures that the traffic 
on link (u,v) of commodity 7 is equal to either the 
full demand or zero: 


Vi, ViUU) 6.00) =; X77U) 


The regularity of the fat tree, combined with re- 
stricted tree routing, helps to reduce the number of 
flow split binary variables. For example, each inter- 
pod flow must go from the aggregation layer to the 
core, with exactly (k/2)? path choices. Rather than 
consider binary variable r for all edges along every 
possible path, we only consider the set of “unique 
edges”, those at the highest layer traversed. In the 
inter-pod case, this is the set of aggregation to edge 
links. We precompute the set of unique edges F; 
usable by commodity 7, instead of using all edges in 
E. Note that the flow conservation equations will 
ensure that a connected set of unique edges are tra- 
versed for each flow. 
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Abstract 


Operators of data centers want a scalable network fab- 
ric that supports high bisection bandwidth and host mo- 
bility, but which costs very little to purchase and admin- 
ister. Ethernet almost solves the problem — it is cheap and 
supports high link bandwidths — but traditional Ethernet 
does not scale, because its spanning-tree topology forces 
traffic onto a single tree. Many researchers have de- 
scribed “scalable Ethernet” designs to solve the scaling 
problem, by enabling the use of multiple paths through 
the network. However, most such designs require spe- 
cific wiring topologies, which can create deployment 
problems, or changes to the network switches, which 
could obviate the commodity pricing of these parts. 

In this paper, we describe SPAIN (“Smart Path Assign- 
ment In Networks’’). SPAIN provides multipath forward- 
ing using inexpensive, commodity off-the-shelf (COTS) 
Ethernet switches, over arbitrary topologies. SPAIN pre- 
computes a set of paths that exploit the redundancy in a 
given network topology, then merges these paths into a 
set of trees; each tree is mapped as a separate VLAN 
onto the physical Ethernet. SPAIN requires only mi- 
nor end-host software modifications, including a sim- 
ple algorithm that chooses between pre-installed paths 
to efficiently spread load over the network. We demon- 
strate SPAIN’s ability to improve bisection bandwidth 
over both simulated and experimental data-center net- 
works. 


1 Introduction 


Data-center operators often take advantage of scale, 
both to amortize fixed costs, such as facilities and staff, 
over many servers, and to allow high-bandwidth, low- 
latency communications among arbitrarily large sets of 
machines. They thus desire scalable data-center net- 
works. Data-center operators also must reduce costs for 
both equipment and operations; commodity off-the-shelf 
(COTS) components often provide the best total cost of 
ownership (TCO). 

Ethernet is becoming the primary network technology 
for data centers, especially as protocols such as Fibre 
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Channel over Ethernet (FCoE) begin to allow conver- 
gence of all data-center networking onto a single fabric. 
COTS Ethernet has many nice features, especially ubiq- 
uity, self-configuration, and high link bandwidth at low 
cost, but traditional Layer-2 (L2) Ethernet cannot scale to 
large data centers. Adding IP (Layer-3) routers “solves” 
the scaling problem via the use of subnets, but introduces 
new problems, especially the difficulty of supporting dy- 
namic mobility of virtual machines: the lack of a single 
flat address space makes it much harder to move a VM 
between subnets. Also, the use of IP routers instead of 
Ethernet switches can increase hardware costs and com- 
plicate network management. 

This is not a new problem; plenty of recent research 
papers have proposed scalable data-center network de- 
signs based on Ethernet hardware. All such propos- 
als address the core scalability problem with traditional 
Ethernet, which is that to support self-configuration of 
switches, it forces all traffic into a single spanning 
tree [28] — even if the physical wired topology provides 
multiple paths that could, in principle, avoid unnecessary 
sharing of links between flows. 

In Sec. 3, we discuss previous proposals in specific 
detail. Here, at the risk of overgeneralizing, we assert 
that SPAIN improves over previous work by providing 
multipath forwarding using inexpensive, COTS Ethernet 
switches, over arbitrary topologies, and supporting incre- 
mental deployment; we are not aware of previous work 
that does all four. 

Support for COTS switches probably reduces costs, 
and certainly reduces the time before SPAIN could be 
deployed, compared to designs that require even small 
changes to switches. Support for arbitrary topologies is 
especially important because it allows SPAIN to be used 
without re-designing the entire physical network, in con- 
trast to designs that require hypercubes, fat-trees, etc., 
and because there may be no single topology that best 
meets all needs. Together, both properties allow incre- 
mental deployment of SPAIN in an existing data-center 
network, without reducing its benefits in a purpose-built 
network. SPAIN can also function without a real-time 
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central controller, although it may be useful to exploit 
such a controller to guarantee specific QoS properties. 

In SPAIN, an offline network controller system first 
pre-computes a set of paths that exploit the redundancy 
in a given network topology, with the goal of utilizing 
the redundancy in the physical wiring both to provide 
high bisection bandwidth (low over-subscription), and to 
support several failover paths between any given pair of 
hosts. The controller then merges these paths into a set 
of trees and maps each tree onto a separate VLAN, ex- 
ploiting the VLAN support in COTS Ethernet switches. 
In most cases, only a small number of VLANs suffice to 
cover the physical network. 

SPAIN does require modifications to end-host sys- 
tems, including a simple algorithm that chooses between 
pre-installed paths to efficiently spread load over the net- 
work. These modifications are quite limited; in Linux, 
they consist of a loadable kernel module and a user-level 
controller. 

We have evaluated SPAIN both in simulation and in 
an experimental deployment on a network testbed. We 
show that SPAIN adds virtually no end-host overheads, 
that it significantly improves bisection bandwidth on a 
variety of topologies, that it can be deployed incremen- 
tally with immediate performance benefits, and that it 
tolerates faults in the network. 


2 Background and goals 


Ethernet is known for its ease-of-use. Hosts come 
with preset addresses and simply need to be plugged in; 
each network switch automatically learns the locations 
of other switches and of end hosts. Switches organize 
themselves into a spanning tree to form loop-free paths 
between all source-destination pairs. Hence, not surpris- 
ingly, Ethernet now forms the basis of virtually all en- 
terprise and data center networks. This popularity made 
many Ethernet switches an inexpensive commodity and 
led to continuous improvements. 10Gbps Ethernet is fast 
becoming commoditized [18], the 40Gbps standard is ex- 
pected this year [21], and the standardization of 100Gbps 
is already underway [6]. 

Network operators would like to be able to scale Eth- 
ernet to an entire data center, but it is very difficult to 
do so, as we detail in section 2.2. Hence, today most 
such networks are designed as several modest-sized Eth- 
ernets (IP subnets), connected by one or two layers of IP 
routers [2, 3, 10]. 


2.1 


The use of multiple IP subnets, especially within a data 
center, creates significant management complexity. In 
particular, it requires a network administrator to simulta- 
neously and consistently manage the L2 and L3 layers, 


Why we want Ethernet to scale 
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even though these are based on very different forward- 
ing, control, and administrative mechanisms. 

Consider a typical network composed of Ethernet- 
based IP subnets. This not only requires the configura- 
tion of IP subnets and routing protocols—which is con- 
sidered a hard problem in itself [22]—but also sacrifices 
the simplicity of Ethernet’s plug-and-play operation. For 
instance, as explained in [2, 3], in such a hybrid network, 
to allow the end hosts to efficiently reach the [P-routing 
layer, all Ethernet switches must be configured such that 
their automatic forwarding table computation 1s forced to 
pick only the shortest paths between the [P-routing layer 
and the hosts. 

Dividing a data center into a set of IP subnets has 
other drawbacks. It imposes the need to configure DHCP 
servers for each subnet; to design an IP addressing as- 
signment that does not severely fragment the IP address 
space (especially with IPv4); and makes it hard to deal 
with topology changes [22]. For example, migrating a 
live virtual machine from one side of the data center 
to another, to deal with a cooling imbalance, requires 
changing that VM’s IP address — this can disrupt exist- 
ing connections. 

For these reasons, it becomes very attractive to scale a 
single Ethernet to connect an entire data center or enter- 
prise. 


2.2 Why Ethernet is hard to scale 


Ethernet’s lack of scalability stems from three main 
problems: 

1. Its use of the Spanning Tree Protocol to automati- 

cally ensure a loop-free topology. 

2. Packet floods for learning host locations. 

3. Host-generated broadcasts, especially for ARP. 

We discuss each of these issues. 

Spanning tree: Spanning Tree Protocol (STP) [28] 
was a critical part of the initial success of Ethernet; it 
allows automatic self-configuration of a set of relatively 
simple switches. Using STP, all the switches in an L2 
domain agree on a subset of links between them, so as 
to form a spanning tree over all switches. By forwarding 
packets only on those links, the switches ensure connec- 
tivity while eliminating packet-forwarding loops. Other- 
wise, Ethernet would have had to carry a hop-count or 
TTL field, which would have created compatibility and 
implementation challenges. 

STP, however, creates significant problems for scal- 
able data-center networks: 


e Limited bisection bandwidth: Since there is (by 
definition) only one path through the spanning tree 
between any pair of hosts, a source-destination pair 
cannot use multiple paths to achieve the best pos- 
sible bandwidth. Also, since links on any path are 
probably shared by many other host pairs, conges- 
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tion can arise, especially near the designated (high- 
bandwidth) root switch of the tree. The aggregate 
throughput of the network can be much lower than 
the sum of the NIC throughputs. 

e High-cost core switches: Aggregate throughput 
can be improved by use of a high-fanout, high- 
bandwidth switch at the root of the tree. Scaling 
root-switch bandwidth can be prohibitively expen- 
sive [10], especially since this switch must be repli- 
cated to avoid a single point of failure for the entire 
data center. Also, the STP must be properly config- 
ured to ensure that the spanning tree actually gets 
rooted at this expensive switch. 

e Low reliability: Since the spanning tree leads to 
lots of sharing at links closer to the root, a failure 
can affect an unnecessarily large fraction of paths. 

e Reduced flexibility in node placement: Generally, 
for a given source-destination pair, the higher the 
common ancestor in the spanning tree, the higher 
the number of competing source-destination pairs 
that share links in the subtree, and thus the lower 
the throughput that this given pair can achieve. 
Hence, to ensure adequate throughput, frequently- 
communicating source-destination pairs must be 
connected to the same switch, or to neighboring 
switches with the lowest possible common ancestor. 
Such restrictions, particularly in case of massive- 
scale applications that require high server-to-server 
bandwidth, inhibit flexibility in workload placement 
or cause substantial performance penalties [10, 18]. 


SPAIN avoids these problems by employing multiple 
spanning trees, which can fully exploit the path redun- 
dancy in the physical topology, especially if the wiring 
topology is not a simple tree. 

Packet _ floods: Ethernet’s automatic _ self- 
configuration is often a virtue: a host can be plugged 
into a port anywhere in the network, and the switches 
discover its location by observing the packets it trans- 
mits [32]. A switch learns the location of a MAC address 
by recording, in its learning table, the switch port on 
which it first sees a packet sent from that address. To 
support host mobility, switches periodically forget these 
bindings and re-learn them. 

If a host has not sent packets in the recent past, there- 
fore, switches will not know its location. When forward- 
ing a packet whose destination address is not in its learn- 
ing table, a switch must “flood” the packet on all of its 
ports in the spanning tree (except on the port the packet 
arrived on). This flooding traffic can be a serious limit to 
scalability [1, 22]. 

In SPAIN, we use a mechanism called chirping (see 
Sec. 6) which avoids most timeout-related flooding. 

Host broadcasts: Ethernet’s original shared-bus de- 
sign made broadcasting easy; not surprisingly, protocols 


such as the Address Resolution Protocol (ARP) and the 
Dynamic Host Configuration Protocol (DHCP) were de- 
signed to exploit broadcasts. Since broadcasts consume 
resources throughout a layer-2 domain, broadcasting can 
limit the scalability of an Ethernet domain [3, 15, 22, 26]. 
Greenberg et al. [16] observe that “...the overhead of 
broadcast traffic (e.g., ARP) limits the size of an IP sub- 
net to a few hundred servers...” 

SPAIN does not eliminate broadcasts, but we can ex- 
ploit certain aspects of both the data-center environment 
and our willingness to modify end-host implementations. 
See [25] for more discussion. 


3 Related Work 


Spanning trees in Ethernet have a long history. The 
original algorithm was first proposed in 1985 [28], and 
was adapted as the IEEE 802.1D standard in 1990. Since 
then it has been improved and adapted along several 
dimensions. While the Rapid Spanning Tree Protocol 
(802.1s) reduces convergence time, the Per- VLAN Span- 
ning Tree (802.1Q) improves link utilization by allowing 
each VLAN to have its own spanning tree. Sharma et al. 
exploit these multiple spanning trees to achieve improved 
fault recovery [33]. In their work, Viking manager, a cen- 
tral entity, communicates and pro-actively manages both 
switches and end hosts. Based on its global view of the 
network, the manager selects (and, if needed, dynami- 
cally re-configures) the spanning trees. 

Most proposals for improving Ethernet scalability fo- 
cus on eliminating the restrictions to a single spanning 
tree. 

SmartBridge, proposed by Rodeheffer et al. [30], com- 
pletely eliminated the use of a spanning tree. Smart- 
Bridges learn, based on the principles of diffused compu- 
tation, locations of switches as well as hosts to forward 
packets along the shortest paths. STAR, a subsequent 
architecture by Lui et al. [24] achieves similar benefits 
while also facilitating interoperability with the 802.1D 
standard. Perlman’s RBridges, based on an IS-IS rout- 
ing protocol, allow shortest paths, can inter-operate with 
existing bridges, and can also be optimized for IP [29]. 
Currently, this work is being standardized by the TRILL 
working group of IETF [7]. Note that TRILL focuses 
only on shortest-path or equal-cost routes, and does not 
support multiple paths of different lengths. 

Myers et al. [26] proposed eliminating the basic rea- 
son for the spanning tree, the reliance on broadcast as 
the basic primitive, by combining link-state routing with 
a directory for host information. 

More recently, Kim et al. [22] proposed the SEAT- 
TLE architecture for very large and dynamic Ethernets. 
Their switches combine link-state routing with a DHT 
to achieve broadcast elimination and shortest-path for- 
warding, without suffering the large space requirements 
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Table 1: Comparing SPAIN against related work 


Po SPAIN SEATTLE [22] | TRILL [7] | VL2[17] | PortLand [27] | MOOSE [31] 


Wing Topology [[_Arbrary [Arbitrary Arbivary | Faree [Favuree [Arbitrary — 
Usable paths || Arb. multiple paths Single Path a ao Single Path 


YES 


Deploy incrementally? | 


C 
NO | YES | NO 


Uses COTS switches? [| YES(L2)_ | NO | NOYES 1 =no x0 





Needs end-host mods? [| YES] NO | NO_| YES | NO | NO 


Fat-tree = multi-rooted tree; ECMP = Equal Cost Multi-Path. 


of some of the prior approaches; otherwise, SEATTLE is 
quite similar to TRILL. 

Greenberg et al. [18] proposed an architecture that 
scales to 100,000 or more servers. They exploit pro- 
grammable commodity layer-2 switches, allowing them 
to modify the data and control planes to support hot-spot- 
free multipath routing. A sender host for each flow con- 
sults a central directory and determines a random inter- 
mediary switch; it then bounces the flow via this inter- 
mediary. When all switches know of efficient paths to 
all other switches, going via a random intermediary is 
expected to achieve good load spreading. 

Several researchers have proposed specific regular 
topologies that support scalability. Fat trees, in par- 
ticular, have received significant attention. Al-Fares et 
al. [10] advocate combining fat trees with a specific IP 
addressing assignment, thereby supporting novel switch 
algorithms that provide high bisection bandwidth with- 
out expensive core switches. Mysore et al. [27] update 
this approach in their PortLand design, which uses MAC- 
address re-writing instead of IP addressing, thus creating 
a flat L2 network. Scott et al. [31] similarly use MAC- 
address re-writing in MOOSE, but without imposing a 
specific topology; however, MOOSE uses shortest-path 
forwarding, rather than multipath. 

VL2 [17] provides the illusion of a large L2 net- 
work on top of an IP network with a Clos [14] topol- 
ogy, using a logically centralized directory service. VL2 
achieves Equal-Cost Multipath (ECMP) forwarding in 
Clos topologies by assigning a single IP anycast address 
to all core switches. It is not obvious how one could 
assign such IP anycast addresses to make multipath for- 
warding work in non-Clos topologies. 

The commercial switch vendor Woven Systems [8] 
also used a fat tree for the interconnect inside their switch 
chassis, combing their proprietary vScale chips with Eth- 
ernet switch chips that include specific support for fat- 
trees [4]. The vScale chips use a proprietary algorithm to 
spread load across the fat-tree paths. 

In contrast to fat-tree topologies, others have proposed 
recursive topologies such as hypercubes. These include 
DCell [20] and BCube [19]. 

As summarized in Tab. 1, SPAIN differs from all of 
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this prior work because it provides multipath forward- 
ing, uses unmodified COTS switches, works with arbi- 
trary topologies, supports incremental deployment, and 
requires no centralized controllers. 


4 The design of SPAIN 


We start with our specific goals for SPAIN, including 
the context in which it operates. Our goals are to: 


e Deliver more bandwidth and better reliability than 
spanning tree. 

e Support arbitrary topologies, not just fat-tree or hy- 
percube, and extract the best bisection bandwidth 
from any topology. 

e Utilize unmodified, off-the-shelf, 
priced (COTS) Ethernet switches. 

e Minimize end host software changes, and be incre- 
mentally deployable. 


commodity- 


In particular, we want to support flat Layer-2 addressing 
and routing, so as to: 

e Simplify network manageability by retaining the 
plug-and-play properties of Ethernet at larger 
scales. 

e Facilitate non-routable protocols, such as Fibre 
Channel over Ethernet (FCoE), that are required for 
“fabric convergence” within data centers [23]. Fab- 
ric convergence, the replacement of special-purpose 
interconnects such as Fibre Channel with standard 
Ethernet, can reduce hardware costs, management 
costs, and rack space. 

e Improve the flexibility of virtual server and stor- 
age placement within data centers, by reducing the 
chances that arbitrary placement could create band- 
width problems, and by avoiding the complexity of 
VM migration between IP subnets. 


We explicitly limit the focus of SPAIN to data-center net- 
works, rather than trying to solve the general problem 
of how to scale Ethernet. Also, while we believe that 
SPAIN will scale to relatively large networks, our goal is 
not to scale to arbitrary sizes, but to support typical-sized 
data-centers. 
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— VLAN 1 
-- VLAN 2 
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Figure 1: Example of VLANs used for multipathing 


4.1 Overview of SPAIN 


In SPAIN, we pre-compute a set of paths that utilizes 
the redundancy in the physical wiring, both to provide 
high bisection bandwidth and to improve fault tolerance. 
We then merge these paths into a set of trees, map each 
tree to a separate VLAN, and install these VLANs on the 
switches. We usually need only a few VLANs to cover 
the physical network, since a single VLAN ID can be 
re-used for multiple disjoint subtrees. 

SPAIN allows a pair of end hosts to use different 
VLANs, potentially traversing different links at differ- 
ent times, for different flows; hence, SPAIN can achieve 
higher throughput and better fault-tolerance than tradi- 
tional spanning-tree Ethernet. 

SPAIN reserves VLAN | to include all nodes. This 
default VLAN is thus always available as a fallback path, 
or if we need to broadcast or multicast to all nodes. We 
believe that we can support multicast more efficiently by 
mapping multicast trees onto special VLANs, but this is 
future work. 

SPAIN requires only a few switch features: MAC- 
address learning and VLAN support; these are already 
present in most COTS switches. Optionally, SPAIN can 
exploit other switch features to improve performance, 
scale, and fault tolerance, or to reduce manual configu- 
ration: LLDP; SNMP queries to get LLDP information; 
and the Per-VLAN Spanning Tree Protocol or the Multi- 
ple Spanning Tree Protocol (see Sec. 5.5). 

SPAIN requires a switch to store multiple table entries 
(one per VLAN tree) for each destination, in the worst 
case where flows are active for all possible (VLAN, des- 
tination) pairs. (Table overflows lead to packet flooding; 
they are not fatal.) This could limit SPAIN’s applicability 
to very large networks with densely populated traffic ma- 
trices, but even inexpensive merchant-silicon switches 
have sufficiently large tables for moderately-large net- 
works. 

For data centers where MAC addresses are known a 
priori, we have designed another approach called FIB- 
pinning, but do not describe it here due to space con- 
straints. See [25] for more details. 

Fig. 1 illustrates SPAIN with a toy example, which 
could be a fragment of a larger data-center network. 
Although there is a link between switches S1 and S2, 
the standard STP does not forward traffic via that link. 


SPAIN creates two VLANs, with VLANI covering the 
normal spanning tree, and VLAN2 covering the alternate 
link. Once the VLANs have been configured, end-host 
A could (for example) use VLANI for flows to C while 
end-host B uses VLAN2 for flows to D, thereby dou- 
bling the available bandwidth versus traditional Ether- 
net. (SPAIN allows more complex end-host algorithms, 
to support fault tolerance and load balancing.) 

Note that TRILL or SEATTLE, both of are shortest- 
path (or equal-cost multi-path) protocols, would only use 
the path corresponding to VLAN2. 

SPAIN requires answers to three questions: 

1. Given an arbitrary topology of links and switches, 
with finite switch resources, how should we com- 
pute the possible paths to use between host pairs? 

2. How can we set up the switches to utilize these 
paths? 

3. How do pairs of end hosts choose which of several 
possible paths to use? 

Thus, SPAIN includes three key components, for path 
computation, path setup, and path selection. The first 
two can run offline (although online reconfiguration 
could help improve network-wide QoS and failure re- 
silience); the path selection process runs online at the 
end hosts for each flow. 


5 Offline configuration of the network 


In this section, we describe the centralized algorithms 
SPAIN uses for offline network configuration: path com- 
putation and path setup. (Sec. 6 discusses the online, 
end-host-based path selection algorithms. ) 

These algorithms address several challenges: 


e Which set of paths to use?: The goal is to com- 
pute smallest set of paths that exploit all of the re- 
dundancy in the network. 

e How to map paths to VLANs?: We must mini- 
mize the number of VLANs used, since Ethernet 
only allows 4096 VLANs, and some switches sup- 
port fewer. Also, each VLAN consumes switch re- 
sources — a switch needs to cache a learning-table 
entry for each known MAC on each VLAN. 

e How to handle unplanned topology changes?: 
Physical topologies (links and switches) change e1- 
ther due to failures and repairs of links and switches, 
or due to planned upgrades. Our approach is 
to recompute and re-install paths only during up- 
grades, which should be infrequent, and depend on 
dynamic fault-tolerance techniques to handle un- 
planned changes. 

Because of space constraints, we omit many details of 
the path computation algorithms; these may be found in 
the Technical Report version of the paper [25]. 
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5.1 Practical issues 


SPAIN’s centralized configuration mechanism must 
address two practical issues: learning the actual topol- 
ogy, and configuring the individual switches with the cor- 
rect VLANS. 

Switches use the Link-Layer Discovery Protocol 
(LLDP) (IEEE Standard 802.1AB) to advertise their 
identities and capabilities. They collect the information 
they receive from their neighbors and store it in their 
SNMP MIB. We can leverage this support to program- 
matically determine the topology of the entire L2 net- 
work. 

Switches maintain a VLAN-map table, to track the 
VLANs allowed on each physical interface, along with 
information about whether packets will arrive with a 
VLAN header or not. Each interface can be set in un- 
tagged mode or tagged mode for each VLAN. ! If a port 
is in tagged mode for a VLAN v, packets received on that 
interface with VLAN tag v in the Ethernet header are 
accepted for forwarding. If a port is in untagged mode 
for VLAN uv, all packets received on that port without 
a VLAN tag are assumed to be part of VLAN v. Any 
packet with VLAN v received on a port not configured 
for VLAN v are simply dropped. For SPAIN, we assume 
that this VLAN assignment can be performed program- 
matically using SNMP. 

For each graph computed by the path layout pro- 
gram, SPAIN’s switch configuration module instantiates 
a VLAN corresponding to that graph onto the switches 
covered by that VLAN. For a graph G(V, E’) with VLAN 
number v, this module contacts the switch correspond- 
ing to each vertex in V and sets all ports of that switch 
whose corresponding edges appear in / in tagged mode 
for VLAN v. Also, all ports facing end-hosts are set to 
tagged mode for VLAN v, so that tagged packets from 
end-hosts are accepted. 


5.2 Path-set computation 


Our first goal is to compute a path set: a set of link- 
by-link loop-free paths connecting pairs of end hosts 
through the topology. 

A good path set achieves two simultaneous objectives. 
First, it exploits the available topological redundancy. 
That is, the path set includes enough paths to ensure 
that any source-destination pair, at any given time, can 
find at least one usable path between them. By “usable 
path’, we mean a path that does not go through bottle- 
necked or failed links. Hence a path set that includes 
all possible paths is trivially the best, in terms of ex- 
ploiting the redundancy. However, such a path set might 
be impractical, because switch resources (especially on 
COTS switches) might be insufficient to instantiate so 


' This is the terminology used by HP ProCurve. Cisco uses the terms 
access mode and trunk mode. 
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Algorithm 1 Algorithm for Path Computation 
1: Given: 
2 G full = (Veult, Efutt): The full topology, 
3: w: Edge weights, 
4: s: Source, d: Destination 
5: k: Desired number of paths per s, d pair 
6: 
7 
8 
9 


: Initialize: Ve € EF: w(e) = 1 
: /* shortest computes weighted shortest path */ 
: Path p= shortest(G,s,d,w) ; 
10: for e € pdo 
11: w(e)+ = |F| 
12: 
13: while (|P| < k) do 
14: p=shortest(G,s,d,w) 
15: ifp € P then 
16: /* no more useful paths */ 
17: break ; 
is: P= PU{p} 
19: fore € pdo 


20: w(e)+ = |E| 
21: 
22: return P 


many paths. Thus, the second objective for a good path 
set is that it has a limited number of paths. 

We accomplish this in steps shown in Algorithm 1. 
(This algorithm has been simplified to assume unit 
edge capacities; the extension to non-uniform weights is 
straightforward.) 

First, (lines 7-11), we initialize the set of paths for 
each source-destination pair to include the shortest path. 
Shortest paths are attractive because in general, they min- 
imize the network resources needed for each packet, and 
have a higher probability of staying usable after failures. 
That is, under the simplifying assumption that each link 
independently fails (either bottlenecks or goes down) 
with a constant probability f, then a path p of length |p| 
will be usable with probability P,,(p) = (1—(1—f)!?!). 
(We informally refer to this probability, that a path will 
be usable, as its “usability,” and similarly for the proba- 
bility of a set of paths between a source-destination pair.) 
Clearly, since the shortest path has the smallest length, it 
will have the highest P.,,. 

Then (lines 13-20), we grow the path set to meet the 
desired degree (k) of path diversity between any pair of 
hosts. Note that a path set is usable if at least one of the 
paths is usable. We denote the usability of a path set ps 
as PS',(ps). This probability depends not only on the 
lengths of the paths in the set, but also on the degree of 
shared links between the paths. A best path set of size k 
has the maximum P'S',(-) of all possible path sets of size 
k. However, it is computationally infeasible to find the 
best path set of size k. Hence, we use a greedy algorithm 
that adds one path at a time, and that prefers the path that 
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has the minimum number of links in common with paths 
that are already in the path set. 

We prefer adding a link-disjoint path, because a sin- 
gle link failure can not simultaneously take down both 
the new path and the existing paths. As shown in [25], 
in most networks with realistic topologies and operating 
conditions, a link-disjoint path improves the usability of 
a path set by the largest amount. 

As shown in lines 10-11 and 19-20, we implement 
our preference for link-disjoint paths by incrementing the 
edge weights of the path we have added to the path set by 
a large number (number of edges). This ensures that the 
subsequent shortest-path computation picks a link that is 
already part of the path set only if it absolutely has to. 


5.3. Mapping path sets to VLANs 

Given a set of paths with the desired diversity, SPAIN 
must then map them onto a minimal set of VLANs. (Re- 
member that Ethernet switches support 4096 VLANs, 
sometimes fewer.) 

We need to ensure that the subgraphs formed by the 
paths of each VLAN are loop-free, so that the switches 
work correctly in the face of forwarding-table lookup 
misses. On such a lookup miss for a packet on a VLAN 
v, a Switch will flood the packet to all outgoing inter- 
faces of VLAN v — if the VLAN has a loop, the packet 
will circulate forever. (We could run the spanning-tree 
protocol on each VLAN to ensure there are no loops, 
but then there would be no point in adding links to the 
SPAIN VLANs that the STP would simply remove from 
Service.) 


Problem 1. VLAN Minimization: Given a set of paths 
P = {pi,po,-;Pn} ina graph G = (V,E), find an 
assignment of paths to VLANs, with minimal number of 
VLANs, such that the subgraph formed by the paths of 
each VLAN is loop-free. 


We prove in [25] that Problem | is NP-hard. There- 
fore, we employ a greedy VLAN-packing heuristic, Al- 
gorithm 2. Given the set of all paths P computed in AI- 
gorithm 1, Algorithm 2 processes the paths serially, con- 
structing a set of subgraphs SG that include those paths. 
For each path p, if p is not covered by any subgraph in 
the current set SG, the algorithm tries to greedily pack 
that path p into any one of the subgraphs in the current 
set (lines 6-12). If the greedy packing step fails for a 
path, a new graph is created with this path, and is added 
to SG (lines 13-15). 

Running this algorithm just once might not yield a so- 
lution near the optimum. Therefore, we use the best solu- 
tion from NV runs, randomizing the order in which paths 
are chosen for packing, and the order in which the cur- 
rent set of subgraphs SG are examined. 

The serial nature of Algorithm 2 does not scale well; 
its complexity is O(mkn?), where m is the VLANs, k 


Algorithm 2 Greedy VLAN Packing Heuristic 
1: Given: G = (V, EF), k 
2: SG = 0 /* set of loop-free subgraphs*/ 
3: for v € V do 
4: foru € V do 
5 P =ComputePaths(G, v, u, k) ; 
6 for p © P /* inarandom order */ do 
7 if p not covered by any graph in SG then 
8: Success = FALSE; 
9 for S € SG /* ina random order */ do 


10: if p does not create loop in S' then 
11: Add pto S 

12: Success = TRUE ; 

13: if Success == FALSE then 

14: S’ = new graph with p 

15: SG = SGU{s"} 

16: return SG 





Figure 2: 7-switch topology; original tree in bold 


is the number of paths, and n is the number of switches. 
We have designed a parallel algorithm, described in [25], 
based on graph-coloring heuristics, which yields speedup 
linear in the number of edge switches. 


5.3.1 An example 


Fig. 2 shows a relatively simple wiring topology with 
seven switches. One can think of this as a 1-level tree 
(with switch #1 as the root), augmented by adding three 
cross-connect links to each non-root switch. 

Fig. 3 shows how the heuristic greedy algorithm 
(Alg. 2) chooses seven VLANs to cover this topology. 
VLAN #1 is the original tree (and is used as the default 
spanning tree). 


5.4 Algorithm performance 


Since our algorithm is an approximate solution to an 
NP-hard problem, we applied it to a variety of differ- 
ent topologies that have been suggested for data-center 
networks, to see how many VLANs it requires. Where 
possible, we also present the optimal number of VLANs. 
(See [25] for more details about this analysis.) 

These topologies include FatTree (p) [10], a 2-ary 3- 
tree, where p is the number of ports per switch; BCube 
(p, 1) [19], where p is the number of ports per switch, and 
/ is the number of levels in the recursive construction of 
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Figure 3: VLANs covering 7-switch topology of Fig. 2 


the topology; 2-D HyperX (k) [9], where & is the number 
of switches in each dimension of a 2-D mesh; and Cis- 
coDC, Cisco’s recommended data center network [12], 
a three-layer tree with two core switches, and with pa- 
rameters (m, a) where ™ is the number of aggregation 
modules, and a the number of access switch pairs asso- 
ciated with each aggregation module. 


Table 2: Performance of VLAN mapping heuristic 


Failree ©) on [oar [forall p 


BCube (p, 1) ae 290 for (2,3) 
6 for (3,2) 


2-D HyperX Unknown 12 for k=3 a 
(k) O(k”) 38 for k=4 


CiscoDC ie) 


(™m, a) 


9 for (2,2) 
12 for (3,2) 
18 for (4,3) 


reser ae a 





Table 2 shows the performance of SPAIN’s VLAN 
mapping heuristic on different topologies. The heuris- 
tic matches the optimal mapping on FatTree and BCube. 
We don’t yet know the optimal value for CiscoDC or 2-D 
HyperX, although for 2-D HyperX, k? is a loose upper 
bound. The table also shows that, for the Open Cirrus 
subset used in our experiments (Sec. 10), the heuristic 
uses the optimal number (4) of VLANs. 

The last column in the table shows the number of tri- 
als (NV) it took for SPAIN’s VLAN packing algorithm to 
generate its best result; we show the worst case over five 
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runs, and the averages are much smaller. In some cases, 
luck seems to play a role in how many trials are required. 
Each row took less than 60 sec., using a single CPU (for 
these computations, we used the serial algorithm, not the 
parallel algorithm). 


5.5 Fault tolerance in SPAIN 


A SPAIN-based network must disable the normal STP 
behavior on all switches; otherwise, they will block the 
use of their non-spanning-tree ports, preventing SPAIN 
from using those links in its VLANs. (SPAIN configures 
its VLANs to avoid loops, of course.) Disabling STP 
means that we lose its automatic fault tolerance. 

Instead, SPAIN’s fault tolerance is based on the pre- 
provisioning of multiple paths between pairs of hosts, 
and on end-host detection and recovery from link and 
switch failures; see Sec. 6.6 for details. 

However, SPAIN could use features like Cisco’s pro- 
prietary Per-VLAN Spanning Tree (PVST) or the IEEE 
802.1s standard Multiple Spanning Tree (MST) to im- 
prove fault tolerance. SPAIN could configure switches 
so that, for each VLAN, PVST or MST would prefer the 
ports in that VLAN over other ports (using per-port span- 
ning tree priorities or weights). This allows a switch to 
fail over to the secondary ports if PVST or MST detects a 
failure. SPAIN would still use its end-host failure mech- 
anisms for rapid repair of flows as the spanning tree pro- 
tocols have higher convergence time, and in case some 
switches do not support PVST/MST. 


6 End-host algorithms 


Once the off-line algorithms have computed the 
paths and configured the switches with the appropriate 
VLANs, all of the online intelligence in SPAIN lies in the 
end hosts.2 SPAIN’s end-host algorithms are designed 
to meet five goals: (1) effectively spread load across 
the pre-computed paths, (2) minimize the overheads of 
broadcasting and flooding, (3) efficiently detect and re- 
act to failures in the network, (4) facilitate end-point mo- 
bility (e.g., WM migration), and (5) enable incremental 
deployment. We generically refer to the end-host imple- 
mentation as the “SPAIN driver,” although (as described 
in Sec. 9), some functions run in a user-mode daemon 
rather than in the kernel-mode driver. 

The SPAIN driver has four major functions: boot-time 
initialization, sending a packet, receiving a packet, and 
re-initializing a host after it moves to a new edge switch. 

An end host uses the following data structures and pa- 
rameters: 

e ES(m): the ID of the edge switch to which MAC address 

m 1s currently connected. 


SPAIN could support the use of a centralized service to help end- 
hosts optimize their load balancing, but we have not yet implemented 
this service, nor is it a necessary feature. 
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e V;cach(es): the set of VLANs that reach the edge switch 
es. 

e FR: the reachability VLAN map, a bit map encoding the 

union of V;-cacn(@) over all es, computed by the algo- 

rithms in Section 5. 

Vusable(es): the set of VLANs that have recently tested 

as usable to reach es. 

Tyepin 18 the length of time after which non-TCP flows 

go through the VLAN re-pinning process. 

T'sent 18 the minimum amount of time since last send on 

a VLAN that triggers a chirp (see below). 

Vsent(es): the set of VLANs that we sent a packet via es 

within the last T’;enz seconds. 


SPAIN uses a protocol we call chirping for several func- 
tions: to avoid most timeout-related flooding, to test 
VLAWNs for usability, and to support virtual-machine mi- 
gration. For VM migration, chirping works analogously 
to the Gratuitous ARP (GARP) mechanism, in which a 
host broadcasts an ARP request for its own IP — MAC 
binding. 

An end-host A sends a unicast chirp packet to another 
host B if B has just started sending a flow to A, and if 
A has not sent any packets (including chirps) in the re- 
cent past (T’;-nz¢) to any host connected to the same edge 
switch as B. An end-host (possibly a virtual machine) 
broadcasts a chirp packet when it reboots and when it 
moves to a different switch. A chirp packet carries the 
triple <IP Address, MAC address, Edge-switch ID>. 
Chirp packets also carry a want_reply flag to trigger a 
unicast chirp in response; broadcast chirps never set this 
flag. All hosts that receive a chirp update their ARP ta- 
bles with this JP — MAC address binding; they also 
update the E'S'(m) table. SPAIN sends unicast chirps of- 
ten enough to preempt most of the flooding that would 
arise from entries timing out of switch learning tables. 


6.1 Host initialization 


After a host boots and initializes its NIC drivers, 
SPAIN must do some initialization. The first step is to 
download the VLAN reachability map R from a repos- 
itory. (The repository could be found via a new DHCP 
option.) While this map could be moderately large (about 
SMB for a huge network with 500K hosts and 10K edge 
switches using all 4K possible VLANs), it is compress- 
ible and cachable, and since it changes rarely, a re- 
download could exploit differential update codings. 

Next, the driver determines the ID of the edge switch 
to which it is connected, by listening for Link Layer Dis- 
covery Protocol (LLDP) messages, which switches peri- 
odically send on each port. The LLDP rate (typically, 
once per 30 sec.) is low enough to avoid significant 
end-host loads, but fast enough that a SPAIN driver that 
listens for LLDP messages in parallel with other host- 
booting steps should not suffer much delay. 

Finally, the host broadcasts a chirp packet, on the de- 


Algorithm 3 Selecting a VLAN 


: /* determine the edge switch of the destination */ 

> m=get_dest_mac(flow) 

: es =get_es(m) 

: /* candidate VLANs: those that reach es */ 

: if candidate_vlans is empty then 

/* No candidate VLANs; */ 

/* Either es is on a different SPAIN cloud or m is a non- 
SPAIN host */ 

return the default VLAN (VLAN 1) 

: /* see if any of the candidates are usable */ 

10: wsable_vlans = candidate_vlans (| Vusabie (es) 
11: if wsable_vlans is empty then 

12: return the default VLAN (VLAN 1) 

13: init_probe(candidate_vlans — usable_vlans) 
14: return arandom v € usable_vlans. 


— 


fault VLAN (VLAN 1). Although broadcasts are unre- 
liable, a host Y that fails to receive the broadcast chirp 
from host X will later recover by sending a unicast chirp 
(with the wants_response flag set) when it needs to se- 
lect a VLAN for communicating with host _X. 


6.2 Sending a Packet 


SPAIN achieves high bisection bandwidth by spread- 
ing traffic across multiple VLANs. The SPAIN driver 
must choose which VLAN to use for each flow (we nor- 
mally avoid changing VLANs during a flow, to limit 
packet reordering). Therefore, the driver must decide 
which VLAN to use when a flow starts, and must also de- 
cide whether to change VLANs (for reasons such as rout- 
ing around a fault, or improving load-balance for long 
flows, or to support VM mobility). We divide these into 
two algorithms: for VLAN selection, and for triggering 
re-selection of the VLAN for a flow (which we call re- 
pinning). 

Algorithm 3 shows the procedure for VLAN selection 
for a flow to a destination MAC m. The driver uses 
the E.S(m) to find the edge switch es and then uses the 
reachability map R to find the set of VLANs that reach 
es. If m does not appear in the FS table, then the driver 
uses the default VLAN for this flow, and sends a unicast 
chirp to m to determine if it is a SPAIN host. The driver 
then computes the candidate set by removing VLANs 
that are not in Vi,sabie(es) (which is updated during pro- 
cessing of incoming packets; see Algorithm 5). 

If the candidate set is non-empty, the driver selects a 
member at random and uses this VLAN for the flow.° If 
the set is empty (there are no known-usable VLANs), the 
flow is instead assigned to VLAN 1. The driver initiates 
probing of a subset of all the VLANs that reach the es 
but are currently not usable. 

SPAIN probes a VLAN v to determine whether it can 
be used to reach a given destination MAC m (or its ES) 


3 SPAIN with a dynamic centralized controller could bias this choice 
to improve global load balance; see Sec. 9. 
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by sending a unicast chirp message to m on v. If the 
path through VLAN v is usable and if the chirp reaches 
m, the receiving SPAIN driver responds with its own uni- 
cast chirp message on v, which in turn results in v being 
marked as usable in the probing host (the bit Vi, sabie (es) 
is Set to 1). 

When to re-pin?: Occasionally, SPAIN must change 
the VLAN assigned to a flow, or re-pin the flow. Re- 
pinning helps to solve several problems: 

1. Fault tolerance: when a VLAN fails (that is, a link 
or switch on the VLAN fails), SPAIN must rapidly 
move the flow to a usable VLAN, if one is available. 

2. VM migration: if a VM migrates to a new edge 
switch, SPAIN may have to re-assign the flow to a 
VLAN that reaches that switch. 

3. Improving load balance: in the absence of an on- 
line global controller to optimize the assignment of 
flows to VLANs, it might be useful to shift a long- 
lived flow between VLANs at intervals, so as to 
avoid pathological congestion accidents for the en- 
tire lifetime of a flow. 

4. Better VLAN probing: the re-pinning process 
causes VLAN probing, which can detect that a 
“down” VLAN has come back up, allowing SPAIN 
to exploit the revived VLAN for better load balance 
and resilience. 

When the SPAIN driver detects either of the first two 
conditions, it immediately initiates re-pinning for the af- 
fected flows. 

However, re-pinning for the last two reasons should 
not be done too frequently, since this causes problems of 
its own, especially for TCP flows: packet reordering, and 
(if re-pinning changes the available bandwidth for a flow) 
TCP slow-start effects. Hence, the SPAIN driver distin- 
guishes between TCP and non-TCP flows. For non-TCP 
flows, SPAIN attempts re-pinning at regular intervals. 

For TCP flows, re-pinning is done only to address fail- 
ure or serious performance problems. The SPAIN driver 
initiates re-pinning for these flows only when the con- 
gestion window has become quite small, and the cur- 
rent (outgoing) packet is a retransmission. Together, 
these two conditions ensure that we do not interfere with 
TCP’s own probing for available bandwidth, and also 
eliminate the possibility of packet reordering. 

Algorithm 4 illustrates the decision process for re- 
pinning a flow; it is invoked whenever the flow attempts 
to send a packet. 

One risk of re-pinning based on decreases in the con- 
gestion window is that it could lead to instability 1f many 
flows are sharing a link that suddenly becomes over- 
loaded. SPAIN tries to prevent oscillations by spreading 
out the re-pinning operations. Also, pinning a flow to 
a new VLAN does not cause the original VLAN to be 
marked as unusable, so new flow arrivals could still be 
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Algorithm 4 Determine if a flow needs VLAN selection 


1: if last_-move_time >= last_pin_time then 

2: /* we moved since last VLAN selection - re-pin flow */ 
3 return true; 

4: current_es = get_es(dst_mac) 

5: if saved_es (from the flow state) ! = current_es then 
6: /* destination moved — update flow state & re-pin */ 
7 saved_es = current_es; 

8: return true 

9: if current_vlan(flow) < 0 then 

10: return true /* new flows need VLAN selection */ 

11: if proto_of( flow) 4 TCP then 

12: if (now — last_pin_time) > Trepin then 


13: return true /* periodic re-pin */ 

14: else 

15: if cwnd(flow) < Wrepin.thresh && is rxmt( flow) 
then 

16: return true /* TCP flow might prefer another path */ 


17: return false /* no need to repin */ 


Algorithm 5 Receiving a Packet 
1: vlan = get_vlan(packet) 
2: m=get_src_mac(packet) 
3: if is_chirp(packet) then 
4: update_ARP_table(packet) 
5: update_ES_table(packet, vlan) 
6: if wants_chirp_response(packet) then 
7 send_unicast_chirp(m, vlan) 
8: es = get_es(m) /* determine sender’s edge switch */ 
9: /* mark packet-arrival VLAN as usable for es */ 
10: Vusabie(es) = Vusadie(es) J vlan 
11: /* chirp if we haven’t sent to es via vlan recently */ 
12: if the vlan bit in Vsent(es) is not set then 
13: send_unicast_chirp(m, vlan) 
14: /* Veenz(es) is cleared every Tent sec. */ 
15: deliver packet to protocol stack 


assigned to that VLAN, which should damp oscillations. 
However, we lack solid evidence that these techniques 
guarantee stability; resolving this issue is future work. 


6.3 Receiving a Packet 


Algorithm 5 shows pseudo-code for SPAIN’s packet 
reception processing. All chirp packets are processed to 
update the host’s ARP table and E'S table (which maps 
MAC addresses to edge switches); if the chirp packet re- 
quests a response, SPAIN replies with its own unicast 
chirp on the same VLAN. 

The driver treats any incoming packet (including 
chirps) as proof of the health of the path to its source edge 
switch es via the arrival VLAN.* It records this observa- 
tion in the Vi,sapie(es) bitmap, for use by Algorithm 3. 

Finally, before delivering the received packet to the 
protocol stack, the SPAIN driver sends a unicast chirp 


‘Tn the case of an asymmetrical failure in which our host’s packets 
are lost, SPAIN will ultimately declare the path dead after our peer 
gives up on the path and stops using it to send chirps to us. 
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to the source host if one has not been sent recently. 
(The pseudo-code omits a few details, including the case 
where the mapping E'S(m) is unknown. The code also 
omits details of deciding which chirps should request a 
chirp in response.) 


6.4 Table housekeeping 


The SPAIN driver must do some housekeeping func- 
tions to maintain some of its tables. First, every time a 
packet is sent, SPAIN sets the corresponding VLAN’s bit 
in Veent(es). 

Periodically, the Vsenz(es) and Vusabie(es) tables 
must be cleared, at intervals of 7';.,,, seconds. To avoid 
chirp storms, our driver performs these table-clearing 
steps in evenly-spaced chunks, rather than clearing the 
entire table at once. 


6.5 Support for end-host mobility 


SPAIN makes a host that moves (e.g., for VM migra- 
tion) responsible for informing all other hosts about its 
new location. In SPAIN, a VLAN is used to represent 
a collection of paths. Most failures only affect a subset 
of those paths. Hence, the usability of a VLAN to reach 
a given destination is a function of the location of the 
sender. When the sender moves, it has to re-learn this 
usability, so it flushes its usability map V,,sabte (es). 

Also, peer end-hosts and Ethernet switches must learn 
where the host is now connected. Therefore, after a host 
has finished its migration, it broadcasts a chirp, which 
causes the recipient hosts to update their ARP and ES ta- 
bles, and which causes Ethernet switches to update their 
learning tables. 


6.6 Handling failures 


Failure detection, for a SPAIN end host, consists of 
detecting a VLAN failure and selecting a new VLAN for 
the affected flows; we have already described VLAN se- 
lection (Algorithm 3). 

While we do not have a formal proof, we believe that 
SPAIN can almost always detect that a VLAN has failed 
with respect to an edge switch es, because most failures 
result in observable symptoms, such as a lack of incom- 
ing packets (including chirp responses) from es, or from 
severe losses on TCP flows to hosts on es. 

SPAIN’s design improves the chances for rapid failure 
detection because it treats all received packets as probes 
(to update V,,,qbi-), and because it aggregates path-health 
information per edge switch, rather than per destination 
host. However, because switch or link failures usually do 
not fully break an entire VLAN, SPAIN does not discard 
an entire VLAN upon failure detection; it just stops using 
that VLAN for the affected edge switch(es). 

SPAIN also responds rapidly to fault repairs; the re- 
ceipt of any packet from a host connected to an edge 
switch will re-establish the relevant VLAN as a valid 


choice. The SPAIN driver also initiates re-probing of a 
failed VLAN if a flow that could have used the VLAN 
is either starting or being re-pinned. At other times, the 
SPAIN driver re-probes less aggressively, to avoid un- 
necessary network overhead. 


7 How SPAIN meets its goals 


We can now summarize how the design of SPAIN ad- 
dresses the major goals we described in Sec. 4. 

Efficiently exploit multiple paths in arbitrary 
topologies: SPAIN’s use of multiple VLANs allows it 
to spread load over all physical links in the network, not 
just those on a single spanning tree. SPAIN’s use of end- 
host techniques to spread load over the available VLANs 
also contributes to this efficiency. 

Support COTS Ethernet switches: SPAIN requires 
only standard features from Ethernet switches. Also, be- 
cause SPAIN does not require routing all non-local traffic 
through a single core switch, it avoids the need for ex- 
pensive switches with high port counts or high aggregate 
bandwidths. 

Tolerate faults: SPAIN pre-computes multiple paths 
through the network, so that when a path fails, it can im- 
mediately switch flows to alternate paths. Also, by avoid- 
ing the need for expensive core switches, it decreases the 
need to replicate expensive components, or to rely on a 
single component for a large subset of paths. 

SPAIN constantly checks path quality (through ac- 
tive probing, monitoring incoming packets, and monitor- 
ing the TCP congestion window), thereby allowing it to 
rapidly detect path failures. 

Support incremental deployment: The correctness 
of SPAIN’s end-host processing does not depend on an 
assumption that all end hosts implement SPAIN. (Our 
experiments in Sec. 10.4, showing the performance of 
SPAIN in incremental deployments, did not require any 
changes to either the SPAIN code or the non-SPAIN 
hosts.) Traffic to and from non-SPAIN hosts automat- 
ically follows the default VLAN, because these hosts 
never send chirp messages and so the SPAIN hosts never 
update their £.S(m) maps for these hosts. 


$ Simulation results 


We first evaluate SPAIN using simulations of a vari- 
ety of network topologies. Later, in Sec. 10, we will 
show experimental measurements using a specific topol- 
ogy, but simulations are the only feasible way to explore 
a broader set of network topologies and scales. 

We use simulations to (1) show how SPAIN increases 
link coverage and potential reliability; (11) quantify the 
switch-resource requirements for SPAIN’s VLAN-based 
approach; and (111) show how SPAIN increases the poten- 
tial aggregate throughput for a network. 

We simulated a variety of regular topologies, as de- 
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Table 3: Summary of simulation results 


| Throughput gain 


[Coverage [NCP i 
tomes | wsvices | ints | miss [str Shane [Sra #VLANs [| PS=1 || PS=a_ 


FatTree(4) 

FatTree(8) 

FatTree(16) 

FatTree(48) 

HyperX(3) 

HyperX(4) 

HyperX(8) 

HyperX(16) 

CiscoDC(2,2) 

CiscoDC(3,2) 

CiscoDC(4,3) 

CiscoDC(8,8) 146 361 
BCube(8,2) 16 128 
BCube(48,2) 4608 
BCube(8,4) 16384 


56.25 
51.04 
31.82 


2304 


2048 4096 


100.00 
100.00 
100.00 





1.44 
1.04 
1.68 


0.14 
0.36 
0.00 


23.81 
22.50 
43.19 


2048 


Key: Coverage= % of links covered by STP; NCP= % of node pairs with no connectivity, for link-failure probability = 0.04; VLANs= # VLANs 
required; Throughput gain= aggregate throughput, normalized to STP, for sufficient flows to saturate the network. PS= Path-set size; a = Maximum 
number of edge-disjoint paths between any two switches; (p/2)* for FatTree(p) topologies, 2k — 2 for HyperX, 3 for CiscoDC, and | for BCube. 


fined in Section 5.4: FatTree (p), BCube (p, /), 2-D Hy- 
perX (k), and CiscoDC (m, a). 

Table 3 summarizes some of our simulation results 
for these topologies. We show results where SPAIN’s 
path-set size PS (the number of available paths per 
source-destination pair) is set to the maximum num- 
ber a of edge-disjoint paths possible in each topology. 
a = (p/2)* for the FatTree topologies, a = 2k — 2 for 
HyperX, a = 3 for CiscoDC, and a = /| for BCube 
(1 is the number of levels in a BCube(p,/) topology). 
For throughput experiments, we also present results for 
PS =1. 

The Coverage column shows the fraction of links cov- 
ered by a spanning tree and SPAIN. Except for the Cis- 
coDC topologies, SPAIN always covers 100% of the 
links. In case of the CiscoDC topologies, our computed 
edge-disjoint paths do not utilize links between the pairs 
of aggregation switches, nor the link between the two 
core switches. Hence, SPAIN’s VLANs do not cover 
these links. 

The NCP (no-connectivity pairs) column is indicative 
of the fault tolerance of a SPAIN network; it shows the 
expected fraction of source-destination pairs that lack 
connectivity, with a simulated link-failure probability of 
0.04, averaged over 10 randomized trials. (These are for 
PS set to the maximum edge-disjoint paths; even for 
PS = 1, SPAIN would be somewhat more fault-tolerant 
than STP.) 

The VLANs column shows the number of VLANs re- 
quired for each topology. For all topologies considered, 
the number is below Ethernet’s 4K limit. 

The Throughput gain columns show the aggregate 
throughput achieved through the network, normalized so 
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that STP = 1. We assume unit per-link capacity and fair 
sharing of links between flows. We also assume that 
SPAIN chooses at random from the a available paths (for 
the PS = a column), and we report the mean of 10 ran- 
domized trials. 

Our throughput simulation is very simple: it starts by 
queueing all N flows (e.g., 1 million) spread across H 
hosts, and then measures the time until all have com- 
pleted. This models a pessimistic case where all host-to- 
host paths are fully loaded. Real-world data-center net- 
works never operate like this; the experiments in Sec. 10 
reflect a case in which a only subset of host-to-host paths 
are fully loaded. For example, SPAIN’s throughput gain 
over STP for our BCube(48,2) simulations peaks at about 
10x when the number of flows is approximately the num- 
ber of hosts (this case is not shown in the table). Also, the 
simulations for SPAIN sometimes favor the PS = 1 con- 
figuration, which avoids the congestion that 1s caused by 
loading too many paths at once (as with PS = a case). 

In summary, SPAIN’s paths cover more than twice 
the links, and with significantly more reliability, than 
spanning-tree’s paths, and, for many topologies and 
workloads, SPAIN significantly improves throughput 
over spanning tree. 


9 Linux end-host implementation 


Our end-host implementation for Linux involves two 
components: a dynamically-loadable kernel-module 
(“driver”) that implements all data-plane functionality, 
and a user-level controller, mostly composed of shell and 
Perl scripts. 

On boot-up (or whenever SPAIN functionality needs 
to be initialized) the user-level controller first determines 
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the MAC address of the network interface. It also de- 
termines the ID of the edge-switch to which the NIC is 
connected, by listening to the LLDP messages sent by 
the switch. It then contacts a central repository, via a 
pre-configured IP address; currently, we hard-code the 
IP address of the repository, but it could be supplied to 
each host via DHCP options. The controller then down- 
loads the reachability map V,-qacn, and optionally a table 
that provides bias weights for choosing between VLANS 
(to support traffic engineering). 

The controller then loads the SPAIN kernel driver, cre- 
ating a spain virtual Ethernet device. Next, the con- 
troller configures the spain virtual device, using the 
following three major steps. 

First, the controller attaches the spain device, as a 
master, to the underlying real eth device (as a slave). 
This master-slave relationship causes all packets that ar- 
rive over the eth device to be diverted to the spain 
device. That allows SPAIN’s chirping protocol to see all 
incoming packets before they are processed by higher- 
layer handlers. 

Second, the controller configures the spain device 
with the same IP and MAC addresses as the underly- 
ing eth device. The controller adjusts the routing table 
so that all the entries that were pointing to the eth de- 
vice are now pointing to spain. This re-routing allows 
SPAIN’s chirping protocol to see all outgoing packets. 

Third, the controller supplies the driver with the maps 
it downloaded from the central repository, via a set of 
special /proc files exposed by the driver. 

The spain driver straightforwardly implements the 
algorithms described in Section 6, while accounting for 
certain idiosyncrasies of the underlying NIC hardware. 
For instance, with NICs that do not support hardware 
acceleration for VLAN tagging on transmitted packets, 
the driver must assemble the VLAN header, insert it be- 
tween the existing Ethernet header fields according to the 
802.1Q specification, and then appropriately update the 
packet meta-data to allow the NIC to correctly compute 
the CRC field. Similarly, NICs with hardware acceler- 
ation for VLAN reception may sometimes deliver a re- 
ceived packet directly to the layer-3 protocol handlers, 
bypassing the normal driver processing. For these NICs, 
the spain driver must install an explicit packet handler 
to intercept incoming packets. (Much of this code is bor- 
rowed directly from the existing 802.1q module.) 

Data structures: The SPAIN driver maintains sev- 
eral tables to support VLAN selection and chirping. To 
Save space, we only discuss a few details. First, we 
maintain multiple bitmaps (for Vusapie and Vsent) rep- 
resenting several time windows, rather than one bitmap; 
this spreads out events, such as chirping, to avoid large 
bursts of activity. Our current implementation ages out 
the stored history after about 20 seconds, which is fast 


enough to avoid FIB timeouts in the switches, without 
adding too much chirping overhead. 

We avoid the use of explicit timers (by letting packet 
events drive the timing, as in “soft timers” [11]), and the 
use of multiprocessor locks, since inconsistent updates to 
these bitmaps do not create incorrect behavior. 

Overall, the driver maintains about 4KB of state for 
each known edge switch, which is reasonable even for 
fairly large networks. 

Limitations: Our current implementation can only 
handle one NIC per server. Data-center servers typically 
support between two and four NICs, mostly to provide 
fault tolerance. We should be able to borrow techniques 
from the existing bonding driver to support simultane- 
ous use of multiple NICs. Also, the current implemen- 
tation does not correctly handle NICs that support TCP 
offload, since this feature is specifically intended to hide 
layer-2 packets from the host software. 


10 Experimental evaluation 


In our experiments, we evaluate four aspects of 
SPAIN: overheads added by the end-host software; how 
SPAIN improves over a traditional spanning tree and 
shortest-path routing; support for incremental deploy- 
ment; and tolerance of network faults. 

We do not compare SPAIN’s performance against 
other proposed data-center network designs, such as 
PortLand [27] or VL2 [17], because these require spe- 
cific network topologies. SPAIN’s support for arbitrary 
topology is an advantage: one can evaluate or use it 
on the topology one has access to. (We plan to rebuild 
our testbed network to support fat-tree topologies, but 
this is hard to do at scale.) However, in Section 10.5, 
we compare SPAIN’s performance against shortest-path 
routing, as is done in SEATTLE [22], TRILL [7], and 
IEEE 802.1aq [5]. 


10.1 Configuration and workloads 


We conducted our evaluation on three racks that are 
part of the (larger) Open Cirrus testbed [13]. All our ex- 
periments are run on 80 servers spread across these three 
racks (rack 1 has 23 servers, rack 2 has 28, and rack 3 
has 29). These servers have quad-core 2GHz Intel Xeon 
CPUs, 8GB RAM and run Ubuntu 9.04. 

Each server is connected to a 3500-series HP 
ProCurve switch using a 1-GigE link, and these rack 
switches are connected to a central 5406-series ProCurve 
switch via 10-GigE links. 

The Open Cirrus cluster was originally wired using 
a traditional two-tiered tree, with the core 5406 switch 
(TS) connected to the 3500 switches (S1, S2, and S3) 
in each logical rack. To demonstrate SPAIN’s bene- 
fits, we added 10-GigE cross-connects between the 3500 
switches, so that each such switch is connected to another 
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~28 blades per 
switch 





Dashed lines represent the non-spanning-tree links that we added. 


Figure 4: Wiring topology used in our experiments 


VLAN: V1 VLAN: V2 
VLAN: V4 


VLAN: V3 


Figure 5: VLANs used by SPAIN for our topology 


switch in each physical rack. Fig. 4 shows the resulting 
wired topology, and Fig. 5 shows the four VLANs com- 
puted by the SPAIN offline configuration algorithms. 

In our tests, we used a “shuffle” workload (similar 
to that used in [17]), an all-to-all memory-to-memory 
bulk transfer among JN participating hosts. This com- 
munication pattern occurs in several important applica- 
tions, such as the shuffle phase between Map and Reduce 
phases of MapReduce, and in join operations in large 
distributed databases. In our workload, each host trans- 
fers SOOMB to every other host using 10 simultaneous 
threads; the order in which hosts choose destinations is 
randomized to avoid deterministic hot spots. With each 
machine sending 500MB to all other machines, this ex- 
periment transfers about 3.16TB. 


10.2 End-host overheads 


We measured end-host overheads of several con- 
figurations, using ping (100 trials) to measure la- 
tency, and NetPerf (10 seconds, 50 trials) to measure 
both uni-directional and simultaneous bi-directional TCP 
throughput between a pair of hosts. 

We found that we could not get optimal bi-directional 
TCP throughput, even for unmodified Linux in a two- 
host configuration, without using “Jumbo” (9000-byte) 
Ethernet packets. (We were able to get optimal one-way 
throughput using 1500-byte packets.) We are not entirely 
sure of the reason for this problem. The TCP experi- 
ments described in this paper all use Jumbo packets. 
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Table 4: End-host overheads 


Configuration 


Unmodified Linux 
Ist pkt, cold start 
SPAIN, no chirping 


ping TCP throughput 
RTT (Mbit/sec) 





(usec) 2-way, [min,max] 


1866 [1858,1872] 


1860 [1852,1871] 


Ist pkt, cold start 
SPAIN w/chirping 
Ist pkt, cold start 
ping results: mean of 100 warm-start trials; 
throughput: mean, min, max of 50 trials 


1866 [1857,1876] 





Table 4 shows the overhead measurements. The 
SPAIN driver does not appear to measurably affect TCP 
throughput or “ping” latency. Note that the chirping 
protocol does not measurably change either throughput 
or warm-start latency, even though it adds some data- 
structure updates on every packet transmission and re- 
ception. However, it appears to increase cold-start la- 
tency slightly, probably because of the CPU costs of al- 
locating and initializing some data structures. 

Table 4 shows throughputs for a single TCP flow 
in each direction. The shuffle workload, described in 
Sec. 10.1, sometimes leads to an imbalance in the num- 
ber of TCP flows entering and leaving a node. We dis- 
covered that (even in unmodified Linux) this imbalance 
can lead to a significant throughput drop for the direction 
with fewer flows. For example, with 9 flows in one direc- 
tion and | flow in the other, the 1-flow direction only gets 
274 Mbps, while the 9-flow direction gets 984 Mbps. We 
are not sure what causes this. 


10.3. SPAIN vs. spanning tree 

Table 5 shows how SPAIN compares to spanning tree 
when running the shuffle workload on the Open Cirrus 
testbed. 


Table 5: Spanning-tree vs. SPAIN 
Spanning Tree | SPAIN | 


Mean goodput/host (Mb/s) 449.25 834.51 
Aggregate goodput (Gb/s) 35.60 66.68 
744.57s | 397.50 s 
831.95s | 431.12s 


Mean completion time/host 
Total shuffle time 





Results are means of 10 trials, 500 MBytes/trial, 80 hosts 


Our SPAIN trials yielded an aggregate goodput of 
66.68 Gbps, which is 83.35% of the ideal 80-node good- 
put of 80 Gbps. This is an improvement of 87.30% over 
the spanning tree topology for the same experiment. 

Based on the two-node bidirectional TCP transfer 
measurement shown in Table 4, the SPAIN goodput 
should have been (80* 1860/2) Mbps or 74.4 Gbps. Thus 
the observed goodput is about 10% less than this ex- 
pected goodput. We suspect that the discrepancy is 
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the result of the decreased throughput, described in 
Sec. 10.2, caused by occasional flow-count imbalances 
during these experiments. Note that we monitored the 
utilization of all links during the SPAIN experiments, and 
did not see any saturated links. 


10.4 Incremental deployability 


One of the key features of SPAIN is incremental de- 
ployability. To demonstrate this, we randomly assigned 
a fraction f of hosts as SPAIN nodes, and disabled 
SPAIN on the remaining hosts. We measured the good- 
put achieved with the shuffle experiment. To account for 
variations in node placement, we ran 10 trials for each 
fraction f, doing different random node assignments for 
each trial. 


Normalized Means 





0 0.2 0.4 0.6 0.8 | 
Fraction of SPAIN hosts 


Results are means over 10 trials 


Figure 6: Incremental deployability 


In Fig. 6, we show how several metrics (per-host good- 
put, aggregate goodput, mean per-host completion time, 
and total shuffle run-time) vary as we change the fraction 
f of nodes on which SPAIN is deployed. The y-values 
for each curve are normalized to the 0%-SPAIN results. 
We saw very little trial-to-trial variation (less than 2.5%) 
in these experiments, with the exception of total shuffle 
time, which varies up to 13% between trials. This varia- 
tion does not seem to depend on the use of SPAIN. 

As expected, the aggregate goodput increases and 
the mean completion times decreases as the fraction of 
SPAIN nodes increases. The curve for the aggregate 
goodput increases until and flattens at about f = 0.9 
at which point none of the links in our network are bot- 
tlenecked. Hence, at f = 0.9, even flows from or to 
non-SPAIN nodes do not experience any congestion. 


10.5 SPAIN vs. Shortest-Path Routing 


Protocols such as SEATTLE [22], TRILL [7], and 
IEEE 802.laq [5] improve over the spanning-tree pro- 
tocol by using Shortest-Path Routing (SPR). Although 
we did not test SPAIN directly against those three pro- 
tocols, we can compare SPAIN’s performance to SPR- 
based paths by emulation: we restrict the paths employed 
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Figure 7: Fault-tolerance experiment 


by SPAIN to only those that use the shortest paths be- 
tween the switches in our test network. In this network, 
as shown in Fig. 4, the shortest paths between switches 
S1, $2, and S3 do not go through the core switch (TS); 
1.e., they do not include the links shown as “VLAN V1” 
in Fig. 5. (TRILL supports equal-cost multipath, but this 
would be hard to apply to the topology of Fig. 4.) 

We then re-ran the shuffle experiment using the SPR 
topology, and achieved an aggregate goodput of 62.28 
Gbps, vs. SPAIN’s 66.61 Gbps goodput. The total shuf- 
fle time for SPR is 512.73 s, vs. SPAIN’s 430 s. As 
mentioned in Sec. 10.3, SPAIN’s throughput is limited 
by CPU overheads, so the relatively minor improvement 
of SPAIN over SPR may be a result of these overheads. 

We note that, regardless of the relative performance 
of SPAIN and SPR, SPAIN retains the advantage of be- 
ing deployable without any changes to switches. TRILL, 
SEATTLE, and IEEE 802.laq Shortest Path Bridging 
will all require switch upgrades. 


10.6 Fault tolerance 


We implemented a simple fault detection and repair 
module that runs at user-level, periodically (100 msec) 
monitoring the performance of flows. It detects that a 
VLAN has failed for a destination if the throughput drops 
by more than 87.5% (equivalent to three halvings of the 
congestion window), in which case it re-pins the flow to 
an alternate VLAN. 

To demonstrate fault-tolerance in SPAIN, we ran a 
simple experiment. We used NetPerf to generate a 50- 
second TCP flow, and measured its throughput every 100 
msec. Fig. 7 shows a partial time-line. At 29.3 sec., we 
removed a link that was in use by this connection. SPAIN 
detects the failure and repairs the end-to-end path; the 
TCP throughput returns to normal within 200-300 msec. 


11 Summary and conclusions 


Our goal for SPAIN was to provide multipath forward- 
ing using inexpensive, COTS Ethernet switches, over ar- 
bitrary topologies, and support incremental deployment. 
We have demonstrated, both in simulations and in ex- 
periments, that SPAIN meets those goals. In particular, 
SPAIN improves aggregate goodput over spanning-tree 
by 87% on a testbed that would not support most other 
scalable-Ethernet designs. 
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We recognize that significant additional work could 
be required to put SPAIN into practice in a large-scale 
network. This work includes the design and implemen- 
tation of a real-time central controller, to support dy- 
namic global re-balancing of link utilizations, and also 
improvements to SPAIN’s end-host mechanisms for as- 
signing flows to VLANs. We also do not fully understand 
how SPAIN will affect broadcast loads in very large net- 
works. 
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Abstract 


Today’s data centers offer tremendous aggregate band- 
width to clusters of tens of thousands of machines. 
However, because of limited port densities in even the 
highest-end switches, data center topologies typically 
consist of multi-rooted trees with many equal-cost paths 
between any given pair of hosts. Existing IP multi- 
pathing protocols usually rely on per-flow static hashing 
and can cause substantial bandwidth losses due to long- 
term collisions. 

In this paper, we present Hedera, a scalable, dy- 
namic flow scheduling system that adaptively schedules 
a multi-stage switching fabric to efficiently utilize aggre- 
gate network resources. We describe our implementation 
using commodity switches and unmodified hosts, and 
show that for a simulated 8,192 host data center, Hedera 
delivers bisection bandwidth that is 96% of optimal and 
up to 113% better than static load-balancing methods. 


1 Introduction 


At arate and scale unforeseen just a few years ago, large 
organizations are building enormous data centers that 
support tens of thousands of machines; others are mov- 
ing their computation, storage, and operations to cloud- 
computing hosting providers. Many applications—from 
commodity application hosting to scientific computing to 
web search and MapReduce—require substantial intra- 
cluster bandwidth. As data centers and their applications 
continue to scale, scaling the capacity of the network fab- 
ric for potential all-to-all communication presents a par- 
ticular challenge. 

There are several properties of cloud-based applica- 
tions that make the problem of data center network de- 
sign difficult. First, data center workloads are a priori 
unknown to the network designer and will likely be vari- 
able over both time and space. As a result, static resource 
allocation is insufficient. Second, customers wish to run 
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their software on commodity operating systems; there- 
fore, the network must deliver high bandwidth without 
requiring software or protocol changes. Third, virtualiza- 
tion technology—commonly used by cloud-based host- 
ing providers to efficiently multiplex customers across 
physical machines—makes it difficult for customers to 
have guarantees that virtualized instances of applications 
run on the same physical rack. Without this physical lo- 
cality, applications face inter-rack network bottlenecks in 
traditional data center topologies [2]. 

Applications alone are not to blame. The routing and 
forwarding protocols used in data centers were designed 
for very specific deployment settings. Traditionally, in 
ordinary enterprise/intranet environments, communica- 
tion patterns are relatively predictable with a modest 
number of popular communication targets. There are 
typically only a handful of paths between hosts and sec- 
ondary paths are used primarily for fault tolerance. In 
contrast, recent data center designs rely on the path mul- 
tiplicity to achieve horizontal scaling of hosts [3, 16, 17, 
19, 18]. For these reasons, data center topologies are 
very different from typical enterprise networks. 

Some data center applications often initiate connec- 
tions between a diverse range of hosts and require signif- 
icant aggregate bandwidth. Because of limited port den- 
sities in the highest-end commercial switches, data cen- 
ter topologies often take the form of a multi-rooted tree 
with higher-speed links but decreasing aggregate band- 
width moving up the hierarchy [2]. These multi-rooted 
trees have many paths between all pairs of hosts. A key 
challenge is to simultaneously and dynamically forward 
flows along these paths to minimize/reduce link oversub- 
scription and to deliver acceptable aggregate bandwidth. 

Unfortunately, existing network forwarding proto- 
cols are optimized to select a single path for each 
source/destination pair in the absence of failures. Such 
static single-path forwarding can significantly underuti- 
lize multi-rooted trees with any fanout. State of the art 
forwarding in enterprise and data center environments 
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uses ECMP [21] (Equal Cost Multipath) to statically 
stripe flows across available paths using flow hashing. 
This static mapping of flows to paths does not account 
for either current network utilization or flow size, with 
resulting collisions overwhelming switch buffers and de- 
grading overall switch utilization. 

This paper presents Hedera, a dynamic flow schedul- 
ing system for multi-stage switch topologies found in 
data centers. Hedera collects flow information from 
constituent switches, computes non-conflicting paths for 
flows, and instructs switches to re-route traffic accord- 
ingly. Our goal is to maximize aggregate network 
utilization—bisection bandwidth—and to do so with 
minimal scheduler overhead or impact on active flows. 
By taking a global view of routing and traffic demands, 
we enable the scheduling system to see bottlenecks that 
switch-local schedulers cannot. 

We have completed a full implementation of Hedera 
on the PortLand testbed [29]. For both our implementa- 
tion and large-scale simulations, our algorithms deliver 
performance that is within a few percent of optimal—a 
hypothetical non-blocking switch—for numerous inter- 
esting and realistic communication patterns, and deliver 
in our testbed up to 4X more bandwidth than state of 
the art ECMP techniques. Hedera delivers these band- 
width improvements with modest control and computa- 
tion overhead. 

One requirement for our placement algorithms is an 
accurate view of the demand of individual flows under 
ideal conditions. Unfortunately, due to constraints at the 
end host or elsewhere in the network, measuring current 
TCP flow bandwidth may have no relation to the band- 
width the flow could achieve with appropriate schedul- 
ing. Thus, we present an efficient algorithm to estimate 
idealized bandwidth share that each flow would achieve 
under max-min fair resource allocation, and describe 
how this algorithm assists in the design of our scheduling 
techniques. 


2 Background 


The recent development of powerful distributed comput- 
ing frameworks such as MapReduce [8], Hadoop [1] and 
Dryad [22] as well as web services such as search, e- 
commerce, and social networking have led to the con- 
struction of massive computing clusters composed of 
commodity-class PCs. Simultaneously, we have wit- 
nessed unprecedented growth in the size and complex- 
ity of datasets, up to several petabytes, stored on tens of 
thousands of machines [14]. 

These cluster applications can often be bottlenecked 
on the network, not by local resources [4, 7, 9, 14, 16]. 
Hence, improving application performance may hinge 
on improving network performance. Most traditional 
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Figure 1: A common multi-rooted hierarchical tree. 


data center network topologies are hierarchical trees 
with small, cheap edge switches connected to the end- 
hosts [2]. Such networks are interconnected by two or 
three layers of switches to overcome limitations in port 
densities available from commercial switches. With the 
push to build larger data centers encompassing tens of 
thousands of machines, recent research advocates the 
horizontal—trather than vertical—expansion of data cen- 
ter networks [3, 16, 17]; instead of using expensive 
core routers with higher speeds and port-densities, net- 
works will leverage a larger number of parallel paths be- 
tween any given source and destination edge switches, 
so-called multi-rooted tree topologies (e.g. Figure 1). 
Thus we find ourselves at an impasse—with network 
designs using multi-rooted topologies that have the po- 
tential to deliver full bisection bandwidth among all com- 
municating hosts, but without an efficient protocol to for- 
ward data within the network or a scheduler to appro- 
priately allocate flows to paths to take advantage of this 
high degree of parallelism. To resolve these problems we 
present the architecture of Hedera, a system that exploits 
path diversity in data center topologies to enable near- 
ideal bisection bandwidth for a range of traffic patterns. 


2.1 Data Center Traffic Patterns 


Currently, since no data center traffic traces are publicly 
available due to privacy and security concerns, we gen- 
erate patterns along the lines of traffic distributions in 
published work to emulate typical data center workloads 
for evaluating our techniques. We also create synthetic 
communication patterns likely to stress data center net- 
works. Recent data center traffic studies [4, 16, 24] show 
tremendous variation in the communication matrix over 
space and time; a typical server exhibits many small, 
transactional-type RPC flows (e.g. search results), as 
well as few large transfers (e.g. backups, backend op- 
erations such as MapReduce jobs). We believe that the 
network fabric should be robust to a range of commu- 
nication patterns and that application developers should 
not be forced to match their communication patterns to 
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Figure 2: Examples of ECMP collisions resulting in reduced bisection bandwidth. Unused links omitted for clarity. 


what may achieve good performance in a particular net- 
work setting, both to minimize development and debug- 
ging time and to enable easy porting from one network 
environment to another. 

Therefore we focus in this paper on generating traffic 
patterns that stress and saturate the network, and com- 
paring the performance of Hedera to current hash-based 
multipath forwarding schemes. 


2.2 Current Data Center Multipathing 


To take advantage of multiple paths in data center topolo- 
gies, the current state of the art is to use Equal-Cost 
Multi-Path forwarding (ECMP) [2]. ECMP-enabled 
switches are configured with several possible forwarding 
paths for a given subnet. When a packet with multiple 
candidate paths arrives, it is forwarded on the one that 
corresponds to a hash of selected fields of that packet’s 
headers modulo the number of paths [21], splitting load 
to each subnet across multiple paths. This way, a flow’s 
packets all take the same path, and their arrival order is 
maintained (TCP’s performance is significantly reduced 
when packet reordering occurs because it interprets that 
as a sign of packet loss due to network congestion). 

A closely-related method is Valiant Load Balancing 
(VLB) [16, 17, 34], which essentially guarantees equal- 
spread load-balancing in a mesh network by bouncing 
individual packets from a source switch in the mesh off 
of randomly chosen intermediate “core” switches, which 
finally forward those packets to their destination switch. 
Recent realizations of VLB [16] perform randomized 
forwarding on a per-flow rather than on a per-packet ba- 
sis to preserve packet ordering. Note that per-flow VLB 
becomes effectively equivalent to ECMP. 

A key limitation of ECMP is that two or more large, 
long-lived flows can collide on their hash and end up on 
the same output port, creating an avoidable bottleneck as 
illustrated in Figure 2. Here, we consider a sample com- 
munication pattern among a subset of hosts in a multi- 
rooted, 1 Gbps network topology. We identify two types 


of collisions caused by hashing. First, TCP flows A and 
B interfere locally at switch Agg0 due to a hash collision 
and are capped by the outgoing link’s 1Gbps capacity to 
Core0. Second, with downstream interference, Agg/ and 
Agg2 forward packets independently and cannot foresee 
the collision at Core2 for flows C' and D. 

In this example, all four TCP flows could have reached 
capacities of 1Gbps with improved forwarding; flow 
A could have been forwarded to Corel, and flow D 
could have been forwarded to Core3. But due to these 
collisions, all four flows are bottlenecked at a rate of 
SOOMbps each, a 50% bisection bandwidth loss. 





Loss in Bisection Bandwidth (% from ideal) 


0 5 10 15 20 
Flows per host 


Figure 3: Example of ECMP bisection bandwidth losses vs. 
number of TCP flows per host for a k=48 fat-tree. 


Note that the performance of ECMP and flow-based 
VLB intrinsically depends on flow size and the num- 
ber of flows per host. Hash-based forwarding performs 
well in cases where hosts in the network perform all-to- 
all communication with one another simultaneously, or 
with individual flows that last only a few RTTs. Non- 
uniform communication patterns, especially those in- 
volving transfers of large blocks of data, require more 
careful scheduling of flows to avoid network bottlenecks. 
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We defer a full evaluation of these trade-offs to Sec- 
tion 6, however we can capture the intuition behind 
performance reduction of hashing with a simple Monte 
Carlo simulation. Consider a 3-stage fat-tree composed 
of 1GigE 48-port switches, with 27k hosts performing 
a data shuffle. Flows are hashed onto paths and each 
link is capped at 1GigE. If each host transfers an equal 
amount of data to all remote hosts one at a time, hash 
collisions will reduce the network’s bisection bandwidth 
by an average of 60.8% (Figure 3). However, if each host 
communicates to remote hosts in parallel across 1,000 si- 
multaneous flows, hash collisions will only reduce total 
bisection bandwidth by 2.5%. The intuition here is that if 
there are many simultaneous flows from each host, their 
individual rates will be small and collisions will not be 
significantly costly: each link has 1,000 slots to fill and 
performance will only degrade if substantially more than 
1,000 flows hash to the same link. Overall, Hedera com- 
plements ECMP, supplementing default ECMP behavior 
for communication patterns that cause ECMP problems. 


2.3. Dynamic Flow Demand Estimation 


Figure 2 illustrates another important requirement for 
any dynamic network scheduling mechanism. The 
straightforward approach to find a good network-wide 
schedule is to measure the utilization of all links in the 
network and move flows from highly-utilized links to 
less utilized links. The key question becomes which 
flows to move. Again, the straightforward approach is to 
measure the bandwidth consumed by each flow on con- 
strained links and move a flow to an alternate path with 
sufficient capacity for that flow. Unfortunately, a flow’s 
current bandwidth may not reflect actual demand. We 
define a TCP flow’s natural demand to mean the rate it 
would grow to in a fully non-blocking network, such that 
eventually it becomes limited by either the sender or re- 
ceiver NIC speed. For example, in Figure 2, all flows 
communicate at 500Mbps, though all could communi- 
cate at 1Gbps with better forwarding. In Section 4.2, we 
show how to efficiently estimate the natural demands of 
flows to better inform Hedera’s placement algorithms. 


3 Architecture 


Described at a high-level, Hedera has a control loop of 
three basic steps. First, it detects large flows at the edge 
switches. Next, it estimates the natural demand of large 
flows and uses placement algorithms to compute good 
paths for them. And finally, these paths are installed on 
the switches. We designed Hedera to support any general 
multi-rooted tree topology, such as the one in Figure 1, 
and in Section 5 we show our physical implementation 
using a fat-tree topology. 
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3.1 Switch Initialization 


To take advantage of the path diversity in multi-rooted 
trees, we must spread outgoing traffic to or from any host 
as evenly as possible among all the core switches. There- 
fore, in our system, a packet’s path is non-deterministic 
and chosen on its way up to the core, and is deterministic 
returning from the core switches to its destination edge 
switch. Specifically, for multi-rooted topologies, there 
is exactly one active minimum-cost path from any given 
core switch to any destination host. 


To enforce this determinism on the downward path, 
we initialize core switches with the prefixes for the IP 
address ranges of destination pods. A pod is any sub- 
grouping down from the core switches (in our fat-tree 
testbed, it is a complete bipartite graph of aggregation 
and edge switches, see Figure 8). Similarly, we initialize 
aggregation switches with prefixes for downward ports 
of the edge switches in that pod. Finally, edge switches 
forward packets directly to their connected hosts. 


When a new flow starts, the default switch behavior 
is to forward it based on a hash on the flow’s 10-tuple 
along one of its equal-cost paths (similar to ECMP). This 
path is used until the flow grows past a threshold rate, at 
which point Hedera dynamically calculates an appropri- 
ate placement for it. Therefore, all flows are assumed to 
be small until they grow beyond a threshold, 100 Mbps 
in our implementation (10% of each host’s 1GigE link). 
Flows are packet streams with the same 10-tuple of <src 
MAC, dst MAC, src IP, dst IP, EtherType, IP protocol, 
TCP src port, dst port, VLAN tag, input port>. 


3.2 Scheduler Design 


A central scheduler, possibly replicated for fail-over and 
scalability, manipulates the forwarding tables of the edge 
and aggregation switches dynamically, based on regu- 
lar updates of current network-wide communication de- 
mands. The scheduler aims to assign flows to non- 
conflicting paths; more specifically, it tries to not place 
multiple flows on a link that cannot accommodate their 
combined natural bandwidth demands. 


In this model, whenever a flow persists for some time 
and its bandwidth demand grows beyond a defined limit, 
we assign it a path using one of the scheduling algorithms 
described in Section 4. Depending on this chosen path, 
the scheduler inserts flow entries into the edge and ag- 
gregation switches of the source pod for that flow; these 
entries redirect the flow on its newly chosen path. The 
flow entries expire after a timeout once the flow termi- 
nates. Note that the state maintained by the scheduler is 
only soft-state and does not have to be synchronized with 
any replicas to handle failures. Scheduler state is not re- 
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Figure 4: An example of estimating demands in a network of 4 hosts. Each matrix element denotes demand per flow as a fraction of the NIC 
bandwidth. Subscripts denote the number of flows from that source (rows) to destination (columns). Entries in parentheses are yet to converge. 


Grayed out entries in square brackets have converged. 


quired for correctness (connectivity); rather it aids as a 
performance optimization. 

Of course, the choice of the specific scheduling algo- 
rithm is open. In this paper, we compare two algorithms, 
Global First Fit and Simulated Annealing, to ECMP. 
Both algorithms search for flow-to-core mappings with 
the objective of increasing the aggregate bisection band- 
width for current communication patterns, supplement- 
ing default ECMP forwarding for large flows. 


4 Estimation and Scheduling 


Finding flow routes in a general network while not ex- 
ceeding the capacity of any link is called the MULTI- 
COMMODITY FLOW problem, which is NP-complete for 
integer flows [11]. And while simultaneous flow routing 
is solvable in polynomial time for 3-stage Clos networks, 
no polynomial time algorithm is known for 5-stage Clos 
networks (1.e. 3-tier fat-trees) [20]. Since we do not 
aim to optimize Hedera for a specific topology, this pa- 
per presents practical heuristics that can be applied to a 
range of realistic data center topologies. 


4.1 Host- vs. Network-Limited Flows 


A flow can be classified into two categories: network- 
limited (e.g. data transfer from RAM) and host-limited 
(e.g. limited by host disk access, processing, etc.). A 
network-limited flow will use all bandwidth available 
to it along its assigned path. Such a flow is limited 
by congestion in the network, not at the host NIC. A 
host-limited flow can theoretically achieve a maximum 
throughput limited by the “slower” of the source and des- 
tination hosts. In the case of non-optimal scheduling, 
a network-limited flow might achieve a bandwidth less 
than the maximum possible bandwidth available from the 
underlying topology. In this paper, we focus on network- 
limited flows, since host-limited flows are a symptom of 
intra-machine bottlenecks, which are beyond the scope 
of this paper. 


4.2 Demand Estimation 


A TCP flow’s current sending rate says little about its 
natural bandwidth demand in an ideal non-blocking net- 


work (Section 2.3). Therefore, to make intelligent flow 
placement decisions, we need to know the flows’ max- 
min fair bandwidth allocation as if they are limited only 
by the sender or receiver NIC. When network limited, 
a sender will try to distribute its available bandwidth 
fairly among all its outgoing flows. TCP’s AIMD be- 
havior combined with fair queueing in the network tries 
to achieve max-min fairness. Note that when there are 
multiple flows from a host A to another host B, each of 
the flows will have the same steady state demand. We 
now describe how to find TCP demands in a hypotheti- 
cal equilibrium state. 


The input to the demand estimator is the set F’ of 
source and destination pairs for all active large flows. 
The estimator maintains an NV x N matrix W/; JN 1s the 
number of hosts. The element in the 7*” row, 7” column 
contains 3 values: (1) the number of flows from host 2 
to host 7, (2) the estimated demand of each of the flows 
from host 2 to host 7, and (3) a “converged”’ flag that 
marks flows whose demands have converged. 


The demand estimator performs repeated iterations of 
increasing the flow capacities from the sources and de- 
creasing exceeded capacity at the receivers until the flow 
capacities converge; Figure 7 presents the pseudocode. 
Note that in each iteration of decreasing flow capacities 
at the receivers, one or more flows converge until even- 
tually all flows converge to the natural demands. The 
estimation time complexity is O(|F']). 


Figure 4 illustrates the process of estimating flow de- 
mands with a simple example. Consider 4 hosts (Ho, 1, 
Hy and H3) connected by a non-blocking topology. Sup- 
pose Ho sends | flow each to H,, Hz and Hz; Hy, sends 
2 flows to Hop and 1 flow to H2; He sends 1 flow each 
to Hp and H3; and H3 sends 2 flows to H,. The figure 
shows the iterations of the demand estimator. The matri- 
ces indicate the flow demands during successive stages 
of the algorithm starting with an increase in flow capac- 
ity from the sender followed by a decrease in flow capac- 
ity at the receiver and so on. The last matrix indicates the 
final estimated natural demands of the flows. 


For real communication patterns, the demand matrix 
for currently active flows is a sparse matrix since most 
hosts will be communicating with a small subset of re- 
mote hosts at a time. The demand estimator is also 
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GLOBAL-FIRST-FIT(f: flow) 
if f.assigned then 
return old path assignment for f 
foreach p © Poc_.qgt do 
if p.used + f.rate < p.capacity then 


return p 


else 
h =HASH(f) 
return p = Pore dot (2) 
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Figure 5: Pseudocode for Global First Fit. GLOBAL-FIRST- 
FIT is called for each flow in the system. 


largely parallelizable, facilitating scalability. In fact, our 
implementation uses both parallelism and sparse matrix 
data structures to improve the performance and memory 
footprint of the algorithm. 


4.3. Global First Fit 


In a multi-rooted tree topology, there are several possible 
equal-cost paths between any pair of source and desti- 
nation hosts. When a new large flow is detected, (e.g. 
10% of the host’s link capacity), the scheduler linearly 
searches all possible paths to find one whose link com- 
ponents can all accommodate that flow. If such a path 
is found, then that flow is “placed” on that path: First, 
a capacity reservation is made for that flow on the links 
corresponding to the path. Second, the scheduler creates 
forwarding entries in the corresponding edge and aggre- 
gation switches. To do so, the scheduler maintains the 
reserved capacity on every link in the network and uses 
that to determine which paths are available to carry new 
flows. Reservations are cleared when flows expire. 

Note that this corresponds to a first fit algorithm; a 
flow is greedily assigned the first path that can accom- 
modate it. When the network is lightly loaded, find- 
ing such a path among the many possible paths is likely 
to be easy; however, as the network load increases and 
links become saturated, this choice becomes more diffi- 
cult. Global First Fit does not guarantee that all flows 
will be accommodated, but this algorithm performs rel- 
atively well in practice as shown in Section 6. We show 
the pseudocode for Global First Fit in Figure 5. 


4.4 Simulated Annealing 


Next we describe the Simulated Annealing scheduler, 
which performs a probabilistic search to efficiently com- 
pute paths for flows. The key insight of our approach is 
to assign a single core switch for each destination host 
rather than a core switch for each flow. This reduces 
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SIMULATED-ANNEALING(n: iteration count) 
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Figure 6: Pseudocode for Simulated Annealing. s denotes 
the current state with energy E(s) = e. eg denotes the best 
energy seen so far in state sg. T' denotes the temperature. en 
is the energy of a neighboring state sj. 


the search space significantly. Simulated Annealing for- 
wards all flows destined to a particular host A through 
the designated core switch for host A. 

The input to the algorithm is the set of all large flows 
to be placed, and their flow demands as estimated by 
the demand estimator. Simulated Annealing searches 
through a solution state space to find a near-optimal so- 
lution (Figure 6). A function £ defines the energy in the 
current state. In each iteration, we move to a neighboring 
state with a certain acceptance probability P, depend- 
ing on the energies in the current and neighboring states 
and the current temperature 7’. The temperature is de- 
creased with each iteration of the Simulated Annealing 
algorithm and we stop iterating when the temperature is 
zero. Allowing the solution to move to a higher energy 
state allows us to avoid local minima. 


1. State s: A set of mappings from destination hosts 
to core switches. Each host in a pod is assigned a 
particular core switch that it receives traffic from. 


2. Energy function /: The total exceeded capacity 
over all the links in the current state. Every state 
assigns a unique path to every flow. We use that 
information to find the links for which the total ca- 
pacity is exceeded and sum up exceeded demands 
over these links. 


3. Temperature 7’: The remaining number of iterations 
before termination. 


4. Acceptance probability P for transition from state s 
to neighbor state s,,, with energies & and Ey. 


IL fb, < EF 


USENIX Association 
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where c is a parameter that can be varied. We em- 
pirically determined that c = 0.5 x To gives best 
results for a 16 host cluster and c = 1000 x To is 
best for larger data centers. 


5. Neighbor generator function NEIGHBOR(): Swaps 
the assigned core switches for a pair of hosts in any 
of the pods in the current state s. 


While simulated annealing is a known technique, our 
contribution lies in an optimization to significantly re- 
duce the search space and the choice of appropriate en- 
ergy and neighbor selection functions to ensure rapid 
convergence to a near optimal schedule. A straightfor- 
ward approach is to assign a core for each flow individ- 
ually and perform simulated annealing. However this re- 
sults in a huge search space limiting the effectiveness of 
simulated annealing. The diameter of the search space 
(maximum number of neighbor hops between any two 
states) with this approach is equal to the number of flows 
in the system. Our technique of assigning core switches 
to destination hosts reduces the diameter of the search 
space to the minimum of the number of flows and the 
number of hosts in the data center. This heuristic reduces 
the search space significantly: in a 27k host data cen- 
ter with 27k large flows, the search space size is reduced 
by a factor of 1012°°°. Simulated Annealing performs 
better when the size of the search space and its diameter 
are reduced [12]. With the straightforward approach, the 
runtime of the algorithm is proportional to the number of 
flows and the number of iterations while our technique’s 
runtime depends only on the number of iterations. 

We implemented both the baseline and optimized ver- 
sion of Simulated Annealing. Our simulations show 
that for randomized communication patterns in a 8,192 
host data center with 16k flows, our techniques deliver 
a 20% improvement in bisection bandwidth and a 10X 
reduction in computation time compared to the baseline. 
These gains increase both with the size of the data center 
as well as the number of flows. 


Initial state: Each pod has some fixed downlink capac- 
ity from the core switches which is useful only for traffic 
destined to that pod. So an important insight here is that 
we should distribute the core switches among the hosts 
in a single pod. For a fat-tree, the number of hosts in a 
pod is equal to the number of core switches, suggesting 
a one-to-one mapping. We restrict our solution search 
space to such assignments, i.e. we assign cores not to 
individual flows, but to destination hosts. Note that this 
choice of initial state is only used when the Simulated 
Annealing scheduler is run for the first time. We use an 
optimization to handle the dynamics of the system which 
reduces the importance of this initial state over time. 


ESTIMATE-DEMANDS() 

1 for all i, 7 

2 Mi; — 0 

3 do 

4 foreach h € H do EST-SRC(h) 
5 foreach h € H do EST-DstT(h) 
6 while some M;,;.demand changed 

f return \/ 
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Figure 7: Demand estimator for TCP flows. M is the demand 
matrix and H is the set of hosts. dr denotes “converged” de- 
mand, ny is the number of unconverged flows, es is the com- 
puted equal share rate, and (src — dst) is the set of flows from 
src to some dst. In EST-DST dr is the total demand, ds is 
sender limited demand, f.rl is a flag for a receiver limited flow 
and np is the number of receiver limited flows. 


Neighbor generator: A well-crafted neighbor generator 
function intrinsically avoids deep local minima. Com- 
plying with the idea of restricting the solution search 
space to mappings with near-uniform mapping of hosts 
in a pod to core switches, our implementation employs 
three different neighbor generator functions: (1) swap 
the assigned core switches for any two randomly chosen 
hosts in a randomly chosen pod, (2) swap the assigned 
core switches for any two randomly chosen hosts in a 
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randomly chosen edge switch, (3) randomly choose an 
edge or aggregation switch with equal probability and 
swap the assigned core switches for a random pair of 
hosts that use the chosen edge or aggregation switch to 
reach their currently assigned core switches. Our neigh- 
bor generator function randomly chooses between the 3 
described techniques with equal probability at runtime 
for each iteration. Using multiple neighbor generator 
functions helps us avoid deep local minima in the search 
spaces of individual neighbor generator functions. 


Calculation of energy function: The energy function 
for a neighbor can be calculated incrementally based on 
the energy in the current state and the cores that were 
swapped in the neighbor. We need not recalculate ex- 
ceeded capacities for all links. Swapping assigned cores 
for a pair of hosts only affects those flows destined to 
those two hosts. So we need to recalculate the difference 
in the energy function only for those specific links in- 
volved and update the value of the energy based on the 
energy in the current state. Thus, the time to calculate 
the energy only depends on the number of large flows 
destined to the two affected hosts. 


Dynamically changing flows: With dynamically chang- 
ing flow patterns, in every scheduling phase, a few flows 
would be newly classified as large flows and a few older 
ones would have completed their transfers. We have im- 
plemented an optimization where we set the initial state 
to the best state from the previous scheduling phase. This 
allows the route-placement of existing, continuing flows 
to be disrupted as little as possible if their current paths 
can still support their bandwidth requirements. Further, 
the initial state that is used when the Simulated Anneal- 
ing scheduler first starts up becomes less relevant over 
time due to this optimization. 


Search space: The key characteristic of Simulated An- 
nealing is assigning unique core switches based on des- 
tination hosts in a pod, crucial to reducing the size 
of the search space. However, there are communica- 
tion patterns where an optimal solution necessarily re- 
quires a single destination host to receive incoming traf- 
fic through multiple core switches. While we omit the 
details for brevity, we find that, at least for the fat tree 
topology, all communication patterns can be handled if: 
1) the maximum number of large flows to or from a host 
is at most k/2, where k is the number of ports in the 
network switches, or i1) the minimum threshold of each 
large flow is set to 2/k of the link capacity. Given that in 
practice data centers are likely to be built from relatively 
high-radix switches, e.g., k > 32, our search space opti- 
mization is unlikely to eliminate the potential for locating 
optimal flow assignments in practice. 
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Global First-Fit 
Simulated Annealing 


Table 1: Time and Space Complexity of Global First Fit and 
Simulated Annealing. k is the number of switch ports, | F’| is 
the total number of large flows, and faug is the average number 
of large flows to a host. The k® factor is due to in-memory link- 
state structures, and the |F’| factor is due to the flows’ state. 


4.5 Comparison of Placement Algorithms 


With Global First Fit, a large flow can be re-routed im- 
mediately upon detection and is essentially pinned to its 
reserved links. Whereas Simulated Annealing waits for 
the next scheduling tick, uses previously computed flow 
placements to optimize the current placement, and deliv- 
ers even better network utilization on average due to its 
probabilistic search. 

We chose the Global First Fit and Simulated Anneal- 
ing algorithms for their simplicity; we take the view that 
more complex algorithms can hinder the scalability and 
efficiency of the scheduler while gaining only incremen- 
tal bandwidth returns. We believe that they strike the 
right balance of computational complexity and delivered 
performance gains. Table | gives the time and space 
complexities of both algorithms. Note that the time com- 
plexity of Global First Fit is independent of | F’|, the num- 
ber of large flows in the network, and that the time com- 
plexity of Simulated Annealing is independent of k. 

More to the point, the simplicity of our algorithms 
makes them both well-suited for implementation in hard- 
ware, such as in an FPGA, as they consist mainly of sim- 
ple arithmetic. Such an implementation would substan- 
tially reduce the communication overhead of crossing the 
network stack of a standalone scheduler machine. 

Overall, while Simulated Annealing is more concep- 
tually involved, we show in Sec. 6 that it almost always 
outperforms Global First Fit, and delivers close to the 
optimal bisection bandwidth both for our testbed and in 
larger simulations. We believe the additional conceptual 
complexity of Simulated Annealing is justified by the 
bandwidth gains and tremendous investment in the net- 
work infrastructure of modern data centers. 





4.6 Fault Tolerance 


Any scheduler must account for switch and link failures 
in performing flow assignments. While we omit the de- 
tails for brevity, our Hedera implementation augments 
the PortLand routing and fault tolerance protocols [29]. 
Hence, the Hedera scheduler is aware of failures us- 
ing the standard PortLand mechanisms and can re-route 
flows mapped to failed components. 
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5 Implementation 


To test our scheduling techniques on a real physical 
multi-rooted network, we built as an example the fat- 
tree network described abstractly in prior work [3]. In 
addition, to understand how our algorithms scale with 
network size, we implemented a simulator to model the 
behavior of large networks with many flows under the 
control of a scheduling algorithm. 


5.1 Topology 


For the rest of the paper, we adopt the following termi- 
nology: for a fat-tree network built from k-port switches, 
there are k pods, each consisting of two layers: lower pod 
switches (edge switches), and the upper pod switches 
(aggregation switches). Each edge switch manages 
(k/2) hosts. The k pods are interconnected by (k/2)? 
core switches. 

One of the main advantages of this topology 1s the high 
degree of available path diversity; between any given 
source and destination host pair, there are (k/2)? equal- 
cost paths, each corresponding to a core switch. Note, 
however, that these paths are not link-disjoint. To take 
advantage of this path diversity (to maximize the achiev- 
able bisection bandwidth), we must assign flows non- 
conflicting paths. A key requirement of our work is to 
perform such scheduling with no modifications to end- 
host network stacks or operating systems. Our testbed 
consists of 16 hosts interconnected using a fat-tree of 
twenty 4-port switches, as shown in Figure 8. 

We deploy a parallel control plane connecting all 
switches to a 48-port non-blocking GigE switch. We em- 
phasize that this control network is not required for the 
Hedera architecture, but is used in our testbed as a de- 
bugging and comparison tool. This network transports 
only traffic monitoring and management messages to and 
from the switches; however, these messages could also 
be transmitted using the data plane. Naturally, for larger 
networks of thousands of hosts, a control network could 
be organized as a traditional tree, since control traffic 
should be only a small fraction of the data traffic. In 
our deployment, the flow scheduler runs on a separate 
machine connected to the 48-port switch. 


5.2 Hardware Description 


The switches in the testbed are 1U dual-core 3.2 GHz 
Intel Xeon machines, with 3GB RAM, and NetFPGA 4- 
port GigE PCI card switches [26]. The 16 hosts are 1U 
quad-core 2.13 GHz Intel Xeon machines with 3GB of 
RAM. These hosts have two GigE ports, the first con- 
nected to the control network for testing and debugging, 
and the other to its NetFPGA edge switch. The control 
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Figure 8: System Architecture. The interconnect shows the 
data-plane network, with GigE links throughout. 


network is organized as a simple star topology. The cen- 
tral switch is a Quanta LB4G 48-port GigE switch. The 
scheduler machine has a dual-core 2.4 GHz Intel Pen- 
tium CPU and 2GB of RAM. 


5.3. OpenFlow Control 


The switches in the tree all run OpenFlow [27], which 
allows access to the forwarding tables for all switches. 
OpenFlow implementations have been ported to a va- 
riety of commercial switches, including those from Ju- 
niper, HP, and Cisco. OpenFlow switches match incom- 
ing packets to flow entries that specify a particular action 
such as duplication, forwarding on a specific port, drop- 
ping, and broadcast. The NetFPGA OpenFlow switches 
have 2 hardware tables: a 32-entry TCAM (that accepts 
variable-length prefixes) and a 32K entry SRAM that 
only accepts flow entries with fully qualified 10-tuples. 

When OpenFlow switches start, they attempt to open a 
secure channel to a central controller. The controller can 
query, insert, modify flow entries, or perform a host of 
other actions. The switches maintain statistics per flow 
and per port, such as total byte counts, and flow dura- 
tions. The default behavior of the switch is as follows: if 
an incoming packet does not match any of the flow en- 
tries in the TCAM or SRAM table, the switch inserts a 
new flow entry with the appropriate output port (based 
on ECMP) which allows any subsequent packets to be 
directly forwarded at line rate in hardware. Once a flow 
grows beyond the specified threshold, the Hedera sched- 
uler may modify the flow entry for that flow to redirect it 
along a newly chosen path. 


5.4 Scheduling Frequency 


Our scheduler implementation polls the edge switches 
for flow statistics (to detect large flows), and performs 
demand estimation and scheduling once every five sec- 
onds. This period is due entirely to a register read- 
rate limitation of the OpenFlow NetFPGA implementa- 
tion. However, our scalability measurements in Section 6 
show that a modestly-provisioned machine can schedule 
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tens of thousands of flows in a few milliseconds, and that 
even at the 5 second polling rate, Hedera significantly 
outperforms the bisection bandwidth of current ECMP 
methods. In general, we believe that sub-second and po- 
tentially sub-100ms scheduling intervals should be pos- 
sible using straightforward techniques. 


5.5 Simulator 


Since our physical testbed is restricted to 16 hosts, we 
also developed a simulator that coarsely models the 
behavior of a network of TCP flows. The simulator 
accounts for flow arrivals and departures to show the 
scalability of our system for larger networks with dy- 
namic communication patterns. We examine our differ- 
ent scheduling algorithms using the flow simulator for 
networks with as many as 8,192 hosts. Existing packet- 
level simulators, such as ns-2, are not suitable for this 
purpose: e.g. a simulation with 8,192 hosts each sending 
at 1Gbps would have to process 2.5 x 101! packets for 
a 60 second run. If a per-packet simulator were used to 
model the transmission of | million packets per second 
using TCP, it would take 71 hours to simulate just that 
one test case. 


Our simulator models the data center topology as a 
network graph with directed edges. Each edge has a fixed 
capacity. The simulator accepts as input a communica- 
tion pattern among hosts and uses it, along with a speci- 
fication of average flow sizes and arrival rates, to gener- 
ate simulated traffic. The simulator generates new flows 
with an exponentially distributed length, with start times 
based on a Poisson arrival process with a given mean. 
Destinations are based upon the suite in Section 6. 


The simulation proceeds in discrete time ticks. At 
each tick, the simulation updates the rates of all flows in 
the network, generates new flows if needed. Periodically 
it also calls the scheduler to assign (new) routes to flows. 
When calling the Simulated Annealing and Global First 
Fit schedulers, the simulator first calls the demand esti- 
mator and passes along its results. 


When updating flow rates, the simulator models TCP 
slow start and AIMD, but without performing per-packet 
computations. Each tick, the simulator shuffles the order 
of flows and computes the expected rate increase for each 
flow, constrained by available bandwidth on the flow’s 
path. If a flow is in slow start, its rate is doubled. If it is 
in congestion avoidance, its rate is additively increased 
(using an additive increase factor of 15 MB/s to simulate 
a network with an RTT of 100 ps). If the flow’s path 
is saturated, the flow’s rate is halved and bandwidth is 
freed along the path. Each tick, we also compute the 
number of bytes sent by the flow and purge flows that 
have completed sending all their bytes. 


NSDI ’10: 7th USENIX Symposium on Networked Systems Design and Implementation 


Since our simulator does not model individual pack- 
ets, it does not capture the variations in performance of 
different packet sizes. Another consequence of this deci- 
sion is that our simulation cannot capture inter-flow dy- 
namics or buffer behavior. As a result, it is likely that 
TCP Reno/New Reno would perform somewhat worse 
than predicted by our simulator. In addition, we model 
TCP flows as unidirectional although real TCP flows in- 
volve ACKs in the reverse direction; however, for 1500 
byte Ethernet frames and delayed ACKs, the bandwidth 
consumed by ACKs is about 2%. We feel these trade-offs 
are necessary to study networks of the scale described in 
this paper. 

We ran each simulation for the equivalent of 60 sec- 
onds and measured the average bisection bandwidth dur- 
ing the middle 40 seconds. Since the simulator does not 
capture inter-flow dynamics and traffic burstiness our re- 
sults are optimistic (simulator bandwidth exceeds testbed 
measurements) for ECMP based flow placement because 
resulting hash collisions would sometimes cause an en- 
tire window of data to be lost, resulting in a coarse- 
grained timeout on the testbed (see Section 6). For the 
control network we observed that the performance in 
the simulator more closely matched the performance on 
the testbed. Similarly, for Global First Fit and Simu- 
lated Annealing, which try to optimize for minimum 
contention, we observed that the performance from the 
simulator and testbed matched very well. Across all the 
results, the simulator indicated better performance than 
the testbed when there is contention between flows. 


6 Evaluation 


This section describes our evaluation of Hedera using our 
testbed and simulator. The goal of these tests is to deter- 
mine the aggregate achieved bisection bandwidth with 
various traffic patterns. 


6.1 Benchmark Communication Suite 


In the absence of commercial data center network traces, 
for both the testbed and the simulator evaluation, we first 
create a group of communication patterns similar to [3] 
according to the following styles: 

(1) Stride(z): A host with index x sends to the host 
with index (x + 7)mod(num_hosts). 

(2) Staggered Prob (EdgeP, PodP): A host sends to 
another host in the same edge switch with probability 
EdgeP, and to its same pod with probability PodP, and to 
the rest of the network with probability /-EdgeP - PodP. 

(3) Random: A host sends to any other host in the 
network with uniform probability. We include bijective 
mappings and ones where hotspots are present. 
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Figure 9: Physical testbed benchmark suite results for the three routing methods vs. a non-blocking switch. Figures indicate 
network bisection bandwidth achieved for staggered, stride, and randomized communication patterns. 


We consider these mappings for networks of different 
sizes: 16 hosts, 1,024 hosts, and 8,192 hosts, correspond- 
ing tok = {4, 16,32}. 


6.2 Testbed Benchmark Results 


We ran benchmark tests as follows: 16 hosts open socket 
sinks for incoming traffic and measure the incoming 
bandwidth constantly. The hosts in succession then start 
their flows according to the sizes and destinations as de- 
scribed above. Each experiment lasts for 60 seconds and 
uses TCP flows; we observed the average bisection band- 
width for the middle 40 seconds. 

We compare the performance of the scheduler on the 
fat-tree network to that of the same experiments on 
the control network. The control network connects all 
16 hosts using a non-blocking 48-port gigabit Ethernet 
switch and represents an ideal network. In addition, we 
include a static hash-based ECMP scheme, where the 
forwarding path is determined by a hash of the destina- 
tion host IP address. 

Figure 9 shows the bisection bandwidth for a variety 
of randomized, staggered, stride and hotspot communi- 
cation patterns; our experiments saturate the links us- 
ing TCP. In virtually all the communication patterns ex- 
plored, Global First Fit and Simulated Annealing signif- 
icantly outperform static hashing (ECMP), and achieve 
near the optimal bisection bandwidth of the network 
(15.4Gb/s goodput). Naturally, the performance of these 
schemes improves as the level of communication local- 
ity increases, as demonstrated by the staggered prob- 


ability figures. Note that for stride patterns (common 
to HPC computation applications), the heuristics con- 
sistently compute the correct flow-to-core mappings to 
efficiently utilize the fat-tree network, whereas the per- 
formance of static hash quickly deteriorates as the stride 
length increases. Furthermore, for certain patterns, these 
heuristics also marginally outperform the commercial 
48-port switch used for our control network. We sus- 
pect this is due to different buffers/algorithms of the Net- 
FPGAs vs. the Quanta switch. 

Upon closer examination of the performance using 
packet captures from the testbed, we found that when 
there was contention between flows, an entire TCP win- 
dow of packets was often lost. So the TCP connection 
was idle until the retransmission timer fired (RTO,,;,, = 
200ms). ECMP hash based flow placement experienced 
over 5 times the number of retransmission timeouts as 
the other schemes. This explains the overoptimistic per- 
formance of ECMP in the simulator as explained in Sec- 
tion 5 since our simulator does not model retransmission 
timeouts and individual packet losses. 


6.3 Data Shuffle 


We also performed an all-to-all in-memory data shuffle 
in our testbed. A data shuffle is an expensive but neces- 
sary operation for many MapReduce/Hadoop operations 
in which every host transfers a large amount of data to 
every other host participating in the shuffle. In this exper- 
iment, each host sequentially transfers 5OOMB to every 
other host using TCP (a 120GB shuffle). 
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Table 2: A 120GB shuffle for the placement heuristics in our 
testbed. Shown is total shuffle time, average host-completion 
time, average bisection bandwidth and average host goodput. 





_ 800 
e ANF teem 
oO ore, 
SB f---- 
sg 640 
3 
3 
og 
3 480 
S 
= 
® 320 
= 
s 
= 160 Non-blocking 
© Simulated Annealing = = = = 
A Global First-Fit ------ 
0 
0 10 20 30 AO 50) 60 


Seconds 


Figure 11: Network bisection bandwidth vs. time for a 1,024 
host fat-tree and a random biective traffic pattern. 


The shuffle results in Table 2 show that centralized 
flow scheduling performs considerably better (39% bet- 
ter bisection bandwidth) than static ECMP hash-based 
routing. Comparing this to the data shuffle performed in 
VL2 [16], which involved all hosts making simultaneous 
transfers to all other hosts (versus the sequential transfers 
in our work), we see that static hashing performs better 
when the number of flows is significantly larger than the 
number of paths; intuitively a hash collision is less likely 
to introduce significant degradation when any imbalance 
is averaged over a large number of flows. For this reason, 
in addition to the delay of the Hedera observation/route- 
computation control loop, we believe that traffic work- 
loads characterized by many small, short RPC-like flows 
would have limited benefit from dynamic scheduling, 
and Hedera’s default ECMP forwarding performs load- 
balancing efficiently in this case. Hence, by threshold- 
ing our scheduler to only operate on larger flows, Hedera 
performs well for both types of communication patterns. 


6.4 Simulation Results 
6.4.1 Communication Patterns 


In Figure 10 we show the aggregate bisection bandwidth 
achieved when running the benchmark suite for a sim- 
ulated fat-tree network with 8,192 hosts (when k=32). 
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Table 3: Percentage of final bisection bandwidth by varying 
the Simulated Annealing iterations, for a case of random 
destinations, normalized to the full network bisection. Also 
shown is the same load running on a non-blocking topology. 


We compare our algorithms against a hypothetical non- 
blocking switch for the entire data center and against 
static ECMP hashing. The performance of ECMP wors- 
ens as the probability of local communication decreases. 
This is because even for a completely fair and perfectly 
uniform hash function, collisions in path assignments 
do happen, either within the same switch or with flows 
at a downstream switch, wasting a portion of the avail- 
able bandwidth. A global scheduler makes discrete flow 
placements that are chosen by design to reduce overlap. 
In most of these different communication patterns, our 
dynamic placement algorithms significantly outperform 
static ECMP hashing. Figure 11 shows the variation over 
time of the bisection bandwidth for the 1,024 host fat-tree 
network. Global First Fit and Simulated Annealing per- 
form fairly close to optimal for most of the experiment. 


6.4.2 Quality of Simulated Annealing 


To explore the parameter space of Simulated Annealing, 
we show in Table 3 the effect of varying the number of it- 
erations at each scheduling period for a randomized, non- 
bijective communication pattern. This table confirms our 
initial intuition regarding the assignment quality vs. the 
number of iterations, as most of the improvement takes 
place in the first few iterations. We observed that the 
performance of Simulated Annealing asymptotically ap- 
proaches the best result found by Simulated Annealing 
after the first few iterations. 


The table also shows the percentage of final bisection 
bandwidth for a random communication pattern as num- 
ber of hosts and flows increases. This supports our be- 
lief that Simulated Annealing can be run with relatively 
few iterations in each scheduling period and still achieve 
comparable performance over time. This is aided by re- 
membering core assignments across periods, and by the 
arrival of only a few new large flows each interval. 
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Figure 10: Comparison of scheduling algorithms for different traffic patterns on a fat-tree topology of 8,192-hosts. 
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Table 4: Demand estimation runtime. 


6.4.3 Complexity of Demand Estimation 


Since the demand estimation is performed once per 
scheduling period, its runtime must be reasonably small 
so that the length of the control loop is as small as pos- 
sible. We studied the runtime of demand estimation for 
different traffic matrices in data centers of varying sizes. 


Table 4 shows the runtimes of the demand estimator 
for different input sizes. The reported runtimes are for 
runs of the demand estimator using 4 parallel threads 
of execution on a modest quad-core 2.13 GHz machine. 
Even for a large data center with 27,648 hosts and 
250,000 large flows (average of nearly 10 large flows per 
host), the runtime of the demand estimation algorithm is 
only 200 ms. For more common scenarios, the runtime 
is approximately 50-100ms in our setup. We expect the 
scheduler machine to be a fairly high performance ma- 
chine with more cores, thereby still keeping the runtime 
well under 100ms even for extreme scenarios. 

The memory requirement for the demand estimator 
in our implementation using a sparse matrix representa- 
tion is less than 20 MB even for the extreme scenario 
with nearly 250,000 large flows in a data center with 
27k hosts. In more common scenarios, with a reasonable 
number of large flows in the data center, the entire data 
structure would fit in the L2 cache of a modern CPU. 

Considering the simplicity and number of operations 
involved, an FPGA implementation can store the sparse 
matrix in an off-chip SRAM. An FPGA such as the Xil- 
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Table 5: Runtime (ms) vs. number of Simulated Annealing 
iterations for different number of flows f. 


inx Virtex-5 can implement up to 200 parallel process- 
ing cores to process this matrix. We estimate that such a 
configuration would have a computational latency of ap- 
proximately 5 ms to perform demand estimation even for 
the case of 250,000 large flows. 


6.4.4 Complexity of Simulated Annealing 


In Table 5 we show the runtime of Simulated Anneal- 
ing for different experimental scenarios. The runtime of 
Simulated Annealing is asymptotically independent of 
the number of hosts and only dependent on the number 
of flows. The main takeaway here is the scalability of 
our Simulated Annealing implementation and its poten- 
tial for practical application; for networks of thousands 
of hosts and a reasonable number of flows per host, the 
Simulated Annealing runtime is on the order of tens of 
milliseconds, even for 10,000 iterations. 


6.4.5 Control Overhead 


To evaluate the total control overhead of the centralized 
scheduling design, we analyzed the overall communica- 
tion and computation requirements for scheduling. The 
control loop includes 3 components—all switches in the 
network send the details of large flows to the scheduler, 
the scheduler estimates demands of the flows and com- 
putes their routes, and the scheduler transmits the new 
placement of flows to the switches. 

We made some assumptions to analyze the length of 
the control loop. (1) The control plane is made up of 
48-port GigE switches with an average 10 jus latency per 
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10 

1,024 100.2 | 100.9 | 101.7 
101.4 | 106.8 | 113.5 
104.6 | 122.8 | 145.5 


8,192 
27,648 





Table 6: Length of control loop (ms). 


switch. (2) The format of messages between the switches 
and the controller are based on the OpenFlow protocol 
(72B per flow entry) [27]. (3) The total computation 
time for demand estimation and scheduling of the flows 
is conservatively assumed to be 100 ms. (4) The last hop 
link to the scheduler is assumed to be a 10 GigE link. 
This higher speed last hop link allows a large number of 
switches to communicate with the scheduler simultane- 
ously. We assumed that the 10 GigE link to the controller 
can be fully utilized for transfer of scheduling updates. 

Table 6 shows the length of the control loop for vary- 
ing number of large flows per host. The values indi- 
cate that the length of the control loop is dominated by 
the computation time, estimated at 100 ms. These re- 
sults show the scalability of the centralized scheduling 
approach for large data centers. 


7 Related Work 


There has been a recent flood of new research pro- 
posals for data center networks; however, none satis- 
fyingly addresses the issue of the network’s bisection 
bandwidth. VL2 [16] and Monsoon [17] propose us- 
ing per-flow Valiant Load Balancing, which can cause 
bandwidth losses due to long-term collisions as demon- 
strated in this work. SEATTLE [25] proposes a single 
Layer 2 domain with a one-hop switch DHT for MAC 
address resolution, but does not address multipathing. 
DCell [19] and BCube [18] suggest using recursively- 
defined topologies for data center networks, which in- 
volves multi-NIC servers and can lead to oversubscribed 
links with deeper levels. Once again, multipathing is not 
explicitly addressed. 

Researchers have also explored scheduling flows in 
a multi-path environment from a wide-area context. 
TeXCP [23] and MATE [10] perform dynamic traffic en- 
gineering across multiple paths in the wide-area by using 
explicit congestion notification packets, which require as 
yet unavailable switch support. They employ distributed 
traffic engineering, whereas we leverage the data center 
environment using a tightly-coupled central scheduler. 
FLARE [31] proposes multipath forwarding in the wide- 
area on the granularity of flowlets (TCP packet bursts); 
however, it is unclear whether the low intra-data center 
latencies meet the timing requirements of flowlet bursts 
to prevent packet reordering and still achieve good per- 
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formance. Miura et al. exploit fat-tree networks by mul- 
tipathing using tagged- VLANs and commodity PCs [28]. 
Centralized router control to enforce routing or access 
control policy has been proposed before by the 4D archi- 
tecture [15], and projects like Tesseract [35], Ethane [6], 
and RCP [5], similar in spirit to Hedera’s approach to 
centralized flow scheduling. 

Much work has focused on virtual switching fab- 
rics and on individual Clos networks in the abstract, 
but do not address building an operational multi-level 
switch architecture using existing commodity compo- 
nents. Turner proposed an optimal non-blocking virtual 
circuit switch [33], and Smiljanic improved Turner’s load 
balancer and focused on the guarantees the algorithm 
could provide in a generalized Clos packet-switched net- 
work [32]. Oki et al. design improved algorithms for 
scheduling in individual 3-stage Clos switches [30], and 
Holmburg provides models for simultaneous and incre- 
mental scheduling of multi-stage Clos networks [20]. 
Geoffray and Hoefler describe a number of strategies 
to increase bisection bandwidth in multistage intercon- 
nection networks, specifically focusing on source-routed, 
per-packet dispersive approaches that break the ordering 
requirement of TCP/IP over Ethernet [13]. 


$ Conclusions 


The most important finding of our work is that in the pur- 
suit of efficient use of available network resources, a cen- 
tral scheduler with global knowledge of active flows can 
significantly outperform the state-of-the-art hash-based 
ECMP load-balancing. We limit the overhead of our ap- 
proach by focusing our scheduling decisions on the large 
flows likely responsible for much of the bytes sent across 
the network. We find that Hedera’s performance gains 
are dependent on the rates and durations of the flows 
in the network; the benefits are more evident when the 
network is stressed with many large data transfers both 
within pods and across the diameter of the network. 

In this paper, we have demonstrated the feasibility of 
building a working prototype of our scheduling system 
for multi-rooted trees, and have shown that Simulated 
Annealing almost always outperforms Global First Fit 
and is capable of delivering near-optimal bisection band- 
width for a wide range of communication patterns both 
in our physical testbed and in simulated data center net- 
works consisting of thousands of nodes. Given the low 
computational and latency overheads of our flow place- 
ment algorithms, the large investment in network infras- 
tructure associated with data centers (many millions of 
dollars), and the incremental cost of Hedera’s deploy- 
ment (e.g., one or two servers), we show that dynamic 
flow scheduling has the potential to deliver substantial 
bandwidth gains with moderate additional cost. 
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Abstract 


We present Airavat, a MapReduce-based system which 
provides strong security and privacy guarantees for dis- 
tributed computations on sensitive data. Airavat is a 
novel integration of mandatory access control and differ- 
ential privacy. Data providers control the security policy 
for their sensitive data, including a mathematical bound 
on potential privacy violations. Users without security 
expertise can perform computations on the data, but Aira- 
vat confines these computations, preventing information 
leakage beyond the data provider’s policy. 

Our prototype implementation demonstrates the flexi- 
bility of Airavat on several case studies. The prototype is 
efficient, with run times on Amazon’s cloud computing 
infrastructure within 32% of a MapReduce system with 
no security. 


1 Introduction 


Cloud computing involves large-scale, distributed com- 
putations on data from multiple sources. The promise of 
cloud computing is based in part on its envisioned ubiq- 
uity: Internet users will contribute their individual data 
and obtain useful services from the cloud. For example, 
targeted advertisements can be created by mining a user’s 
clickstream, while health-care applications of the future 
may use an individual’s DNA sequence to tailor drugs 
and personalized medical treatments. Cloud computing 
will fulfill this vision only if it supports flexible compu- 
tations while guaranteeing security and privacy for the in- 
put data. To balance the competing goals of a permissive 
programming model and the need to prevent information 
leaks, the untrusted code should be confined [30]. 
Contributors of data to cloud-based computations face 
several threats to their privacy. For example, consider a 
medical patient who is deciding whether to participate in 
a large health-care study. First, she may be concerned 
that a careless or malicious application operating on her 
data as part of the study may expose it—for instance, by 
writing it into a world-readable file which will then be 
indexed by a search engine. Second, she may be con- 
cerned that even if all computations are done correctly 
and securely, the result itself, e.g., aggregate health-care 
Statistics computed as part of the study, may leak sensi- 
tive information about her personal medical record. 
Traditional approaches to data privacy are based on 
syntactic anonymization, i.e., removal of “personally 


identifiable information” such as names, addresses, and 
Social Security numbers. Unfortunately, anonymiza- 
tion does not provide meaningful privacy guarantees. 
High-visibility privacy fiascoes recently resulted from 
public releases of anonymized individual data, includ- 
ing AOL search logs [22] and the movie-rating records 
of Netflix subscribers [41]. The datasets in question 
were released to support legitimate data-mining and 
collaborative-filtering research, but naive anonymization 
was easy to reverse in many cases. These events motivate 
a new approach to protecting data privacy. 

One of the challenges of bringing security to cloud 
computing is that users and developers want to spend as 
little mental effort and system resources on security as 
possible. Completely novel APIs, even if secure, are un- 
likely to gain wide acceptance. Therefore, a key research 
question is how to design a practical system that (1) en- 
ables efficient distributed computations, (2) supports a 
familiar programming model, and (3) provides precise, 
rigorous privacy and security guarantees to data owners, 
even when the code performing the computation is un- 
trusted. In this paper, we aim to answer this question. 

Mandatory access control (MAC) is a useful building 
block for securing distributed computations. MAC-based 
operating systems, both traditional [26, 33, 37] and recent 
variants based on decentralized information flow con- 
trol [45, 49, 51], enforce a single access control policy 
for the entire system. This policy, which cannot be over- 
ridden by users, prevents information leakage via storage 
channels such as files, sockets, and program names. 

Access control alone does not achieve end-to-end pri- 
vacy in cloud computing environments, where the input 
data may originate from multiple sources. The output 
of the computation may leak sensitive information about 
the inputs. Since the output generally depends on all in- 
put sources, mandatory access control requires that only 
someone who has access rights to all inputs should have 
access rights to the output; enforcing this requirement 
would render the output unusable for most purposes. To 
be useful, the output of an aggregate computation must 
be “declassified,” but only when it is safe to do so, i.e., 
when it does not reveal too much information about any 
single input. Existing access control mechanisms simply 
delegate this declassification decision to the implementor. 
In the case of untrusted code, there is no guarantee that 
the output of the computation does not reveal sensitive 
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information about the inputs. 

In this paper, we present Airavat,! a system for dis- 
tributed computations which provides end-to-end confi- 
dentiality, integrity, and privacy guarantees using a com- 
bination of mandatory access control and differential pri- 
vacy. Airavat is based on the popular MapReduce frame- 
work, thus its interface and programming model are al- 
ready familiar to developers. Differential privacy 1s anew 
methodology for ensuring that the output of aggregate 
computations does not violate the privacy of individual 
inputs [11]. It provides a mathematically rigorous basis 
for declassifying data in a mandatory access control sys- 
tem. Differential privacy mechanisms add some random 
noise to the output of a computation, usually with only a 
minor impact on the computation’s accuracy. 


Our contributions. We describe the design and imple- 
mentation of Airavat. Airavat enables the execution of 
trusted and untrusted MapReduce computations on sen- 
sitive data, while assuring comprehensive enforcement 
of data providers’ privacy policies. To prevent infor- 
mation leaks through system resources, Airavat runs on 
SELinux [37] and adds SELinux-like mandatory access 
control to the MapReduce distributed file system. To pre- 
vent leaks through the output of the computation, Aira- 
vat enforces differential privacy using modifications to 
the Java Virtual Machine and the MapReduce framework. 
Access control and differential privacy are synergistic: if 
a MapReduce computation is differentially private, the 
security level of its result can be safely reduced. 

To show the practicality of Airavat, we carry out sev- 
eral substantial case studies. These focus on privacy- 
preserving data-mining and data-analysis algorithms, 
such as clustering and classification. The Airavat proto- 
type for these experiments is based on the Hadoop frame- 
work [2], executing in Amazon’s EC2 compute cloud en- 
vironment. In our experiments, Airavat produced accu- 
rate, yet privacy-preserving answers with runtimes within 
32% of conventional MapReduce. 

Airavat provides a practical basis for secure, privacy- 
preserving, large-scale, distributed computations. Poten- 
tial applications include a wide variety of cloud-based 
computing services with provable privacy guarantees, in- 
cluding genomic analysis, outsourced data mining, and 
clickstream-based advertising. 


2 System overview 


Airavat enables the execution of potentially untrusted 
data-mining and data-analysis code on sensitive data. Its 
objective is to accurately compute general or aggregate 
features of the input dataset without leaking information 
about specific data items. 


'The all-powerful king elephant in Indian mythology, known as the 
elephant of the clouds. 
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Computation Provider 


Untrusted Mapper 
+ Mapper range 





Data Providers 
Input: DF, Pg, Pg, © ,0, PG 


Figure 1: High-level architecture of Airavat. 


As a motivating scenario, consider an online retailer, 
BigShop, which holds a large database of customer trans- 
actions. For now, assume that all records in the database 
have the form (customer, order, date), with only one 
record per customer. A machine learning expert, Bob, 
pays BigShop to mine the data for certain transaction pat- 
terns. BigShop loads the data into the Hadoop framework 
and Bob writes the MapReduce code to analyze it. 

Such computations are commonly used for targeted ad- 
vertising and customer relationship management, but we 
will keep the example simple for clarity and assume that 
Bob wants to find the total number of orders placed on a 
particular date D. He writes a mapper that looks at each 
record and emits the key/value pair (K, order) if the date 
on the record is D. Here, K is a string constant. The 
reducer simply sums up the values associated with each 
key K and outputs the result. 

The main risk for BigShop in this scenario is the fact 
that Bob’s code is untrusted and can therefore be uninten- 
tionally buggy or even actively malicious. Because Bob’s 
mapper has direct access to BigShop’s proprietary trans- 
action records, it can store parts of these data in a file 
which will be later accessed by Bob, or it can send them 
over the network. Such a leak would put BigShop at a 
commercial disadvantage and may also present a serious 
reputational risk if individual BigShop transactions were 
made public without the consent of the customer. 

The output of the computation may also leak informa- 
tion. For example, Bob’s mapper may signal the presence 
(or absence) of a certain customer in the input dataset by 
manipulating the order count for a particular day: if the 
record of this customer is in the dataset, the mapper out- 
puts an order count of 1 million; otherwise, it outputs 
zero. Clearly, the result of the computation in this case 
violates the privacy of the customer in question. 


2.1 Architecture of Airavat 


The three main entities in our model are (1) the data 
provider (BigShop, in our motivating example), (2) the 
computation provider (Bob, sometimes referred to as a 
user making a query), and (3) the computation frame- 
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work (Airavat). We aim to prevent malicious computa- 
tion providers from violating the privacy policy of the 
data provider(s) by leaking information about individual 
data items. 

Computation providers write their code in the famil- 
iar MapReduce paradigm, while data providers specify 
the parameters of their privacy policies. We relieve data 
providers of the need to audit computation providers’ 
code for privacy compliance. 

Figure 1 gives an overview of the Airavat architec- 
ture. Airavat consists of modifications to the MapRe- 
duce framework, the distributed file system, and the Java 
virtual machine with SELinux as the underlying operat- 
ing system. Airavat uses SELinux’s mandatory access 
control to ensure that untrusted code does not leak in- 
formation via system resources, including network con- 
nections, pipes, or other storage channels such as names 
of running processes. To prevent information leakage 
through the output of the computation, Airavat relies on 
a differential privacy mechanism [11]. 

Data providers put access control labels on their data 
and upload them to Airavat. Airavat ensures that the re- 
sult of a computation is labeled with the union of all input 
labels. A data provider, D, can set the declassify flag (DF 
in Table 1) to true if he wants Airavat to remove his label 
from the output when it is safe to do so. If the flag is set, 
Airavat removes D’s label if and only if the computation 
is differentially private. Data providers must also create 
a privacy policy by setting the value of several privacy 
parameters (explained in Section 4). 

The computation provider must write his code in the 
Airavat programming model, which is close to standard 
MapReduce. The sensitivity of the function being com- 
puted determines the amount of perturbation that will be 
applied to the output to ensure differential privacy (8 4). 
Therefore, in Airavat the computation provider must sup- 
ply an upper bound on the sensitivity of his computa- 
tion by specifying the range of possible values that his 
mapper code may output. Airavat then ensures that the 
code never outputs values outside the declared range and 
perturbs those within the range so as to ensure privacy 
(§ 5.1). If malicious or incorrect code tries to output a 
value outside its declared range, the enforcement mecha- 
nism guarantees that no privacy breach will occur, but the 
results of the computation may no longer be accurate. 

Apart from specifying the parameters mentioned 
above, neither the data provider, nor the computation 
provider needs to understand the intricacies of differen- 
tial privacy and its enforcement. 


2.2 ‘Trusted computing base of Airavat 


Airavat trusts the cloud provider and the cloud- 
computing infrastructure. It assumes that SELinux cor- 
rectly implements MAC and relies on the MAC features 


Participant 


Labeled data (DB) 
Declassify flag (DF) 
Data provider Privacy parameters (e€, 0) 
Privacy budget (Pp) 
Code to determine privacy groups* (PG) 


Mapper range (M min, Mmax) 
Independent mapper code (Map) 
Number of outputs (N) 

Code to determine partitions* (PC) 
Map of partition to output* (PM) 

Max keys output for any privacy group* 


(n) 


Trusted reducer code (Red) 

Airavat Modified MapReduce (MMR) 
Modified distributed file system (MDFS) 
SELinux policy (SE) 


Table 1: Parameters and components provided by different par- 
ticipants. Optional parameters are starred. 


Computation provider 


(user making a query) 





added to the MapReduce distributed file system, as well 
as on the (modified) Java virtual machine to enforce cer- 
tain properties of the untrusted mapper code (see Sec- 
tion 5.3). Airavat includes trusted implementations of 
several reducer functions. 

We assume that the adversary is a malicious computa- 
tion provider who has full control over the mapper code 
supplied to Airavat. The adversary may attempt to ac- 
cess the input, intermediate, and output files created by 
this code, or to reconstruct the values of individual inputs 
from the result of the computation. 


2.3 Limitations of Airavat 


Airavat cannot confine every computation performed by 
untrusted code. For example, a MapReduce computa- 
tion may output key/value pairs. Keys are text strings 
that provide a storage channel for malicious mappers. In 
general, Airavat cannot guarantee privacy for computa- 
tions which output keys produced by untrusted mappers. 
In many cases, privacy can be achieved by requiring the 
computation provider to declare the key in advance and 
then using Airavat to compute the corresponding value in 
a differentially private way. 

MapReduce computations that necessarily output keys 
require trusted mappers. For example, printing the top 
kK items sold in a store involves printing item names. 
Because a malicious mapper can use a name to encode 
information about individual inputs, this computation re- 
quires trusted mappers. By contrast, the number of iPods 
sold can be calculated using an untrusted mapper because 
the key (“1Pod” in this case) is known prior to the answer 
being released. (See Section 5 for details.) 


3 MapReduce and MAC 


Table 1 lists the components of the system contributed by 
the data provider(s), computation provider(s), and Aira- 
vat. The following discussion explains entries in the table 


me 
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in the context of MapReduce computations, mandatory 
access control, or differential privacy. We place a bold 
label in the text to indicate that the discussion is about a 
particular row in the table. 


3.1 MapReduce 


MapReduce [9] is a framework for performing data- 
intensive computations in parallel on commodity com- 
puters. A MapReduce computation reads input files from 
a distributed file system which splits the file into multi- 
ple chunks. Each chunk is assigned to a mapper which 
reads the data, performs some computation, and outputs 
a list of key/value pairs. In the next phase, reducers com- 
bine the values belonging to each distinct key according 
to some function and write the result into an output file. 
The framework ensures fault-tolerant execution of map- 
pers and reducers while scheduling them in parallel on 
any machine (node) in the system. In MapReduce, com- 
biners are an optional processing stage before the reduce 
phase. They are a performance optimization, so for sim- 
plicity, we defer them to future work. Airavat secures the 
execution of untrusted mappers (Map) using a MAC OS 
(SE), as well as modifications to the MapReduce frame- 
work (MMR) and distributed file system (MDES). 


3.2 Mandatory access control 


Mandatory access control (MAC) assigns security at- 
tributes to system resources and uses these attributes 
to constrain the interaction of subjects (e.g., processes) 
with objects (e.g., files). In contrast to discretionary ac- 
cess control (e.g., UNIX permissions), MAC systems (1) 
check permissions on every operation and transitively en- 
force access restrictions (e.g., processes that access se- 
cret data cannot write non-secret files) and (2) enforce 
access rules specified by the system administrator at all 
times, without user override. MAC systems include 
mainstream implementations such as SELinux [37] and 
AppArmor [1] which appear in Linux distributions, as 
well as research prototypes [45, 49, 51] which imple- 
ment a MAC security model called decentralized infor- 
mation flow control (DIFC). Our current implementation 
uses SELinux because it is a mature system that provides 
sufficient functionality to enforce Airavat’s security poli- 
cles. 

SELinux divides subjects and objects into groups 
called domains or types. The domain is part of the se- 
curity attribute of system resources. A domain can be 
thought of as a sandbox which constrains the permissions 
of the process. For example, the system administrator 
may specify that a given domain can only access files be- 
longing to certain domains. In SELinux, one can specify 
rules that govern transition from one domain to another. 
Generally, a transition occurs by executing a program de- 
clared as the entry point for a domain. 

In SELinux, users are assigned roles. A role governs 


NSDI ’10: 7th USENIX Symposium on Networked Systems Design and Implementation 


the permissions granted to the user by determining which 
domains he can access. For example, the system admin- 
istrator role (Ssysadm_r) has permissions to access the 
ifconfig_t domain and can perform operations on the 
network interface. In SELinux, access decisions are de- 
clared in a policy file which is customized and config- 
ured by the system administrator. The Airavat-specific 
SELinux policy to enforce mandatory access control and 
declassification (SE, DF) is described in Section 6. 


4 Differential privacy 


The objective of Airavat is to enable large-scale compu- 
tation on data items that originate from different sources 
and belong to different owners. The fundamental ques- 
tion of what it means for a computation to preserve the 
privacy of its inputs has been the subject of much research 
(see Section 9). 

Airavat uses the recently developed framework of dif- 
ferential privacy [11, 12, 13, 14] to answer this question. 
Intuitively, a computation on a set of inputs is differen- 
tially private if, for any possible input item, the proba- 
bility that the computation produces a given output does 
not depend much on whether this item is included in the 
input dataset or not. Formally, a computation F satisfies 
(€, 0)-differential privacy [15] (where ¢€ and 0 are privacy 
parameters) if, for all datasets D and D’ whose only dif- 
ference is a single item which is present in D but not D’, 
and for all outputs S C Range(F), 


Pr|F(D) € S] < exp(e) x PrlF(D’) € S}+6 


Another intuitive way to understand this definition is as 
follows. Given the output of the computation, one cannot 
tell if any specific data item was used as part of the input 
because the probability of producing this output would 
have been the same even without that item. Not being 
able to tell whether the item was used at all in the com- 
putation precludes learning any useful information about 
it from the computation’s output alone. 

The computation F must be randomized to achieve 
privacy (probability in the above definition is taken over 
the randomness of F). Deterministic computations are 
made privacy-preserving by adding random noise to their 
outputs. The privacy parameter € controls the tradeoff be- 
tween the accuracy of the output and the probability that 
it leaks information about any individual input. 

The purpose of the 0 parameter is to relax the multi- 
plicative definition of privacy for certain kinds of com- 
putation. Consider TOPWORDS, which calculates the fre- 
quency of words in a corpus and outputs the top 10 words. 
Let D and D’ be two large corpora; the only differ- 
ence is that D contains a single instance of the word 
“‘sesquipedalophobia,’ while D’ does not. The proba- 
bility that TOPWORDS outputs “sesquipedalophobia” is 
very small on input D and zero on input D’. The mul- 
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tiplicative bound on the ratio between these probabilities 
required by differential privacy cannot be achieved (since 
one of the probabilities is zero), but the absolute differ- 
ence is very small. The purpose of 6 in the definition is 
to allow a small absolute difference in probabilities. In 
many of the computations considered in this paper, this 
situation does not arise and 6 can be safely set to 0. 

In Section 9, we discuss why differential privacy is the 
“right” concept of privacy for cloud computing. The most 
important feature of differential privacy is that it does not 
make any assumptions about the adversary. When satis- 
fied, it holds regardless of the auxiliary or prior knowI- 
edge that the adversary may possess. Furthermore, dif- 
ferential privacy is composable: a composition of two 
differentially private computations is also differentially 
private (of course, € and 0 may increase). 

There are many mechanisms for achieving differential 
privacy [5, 14, 17, 40]. In this paper, we will use the 
mechanism that adds Laplacian noise to the output of a 
computation f : D — R*: 


f(x) + (Lap(Af/e))" 


where Lap(Af /e) is a symmetric exponential distribu- 
tion with standard deviation 2A f /e. 


Privacy groups. To provide privacy guarantees which 
are meaningful to users, it 1s sometimes important to 
consider input datasets that differ not just on a single 
record, but on a group of records (PG). For example, 
when searching for a string within a set of documents, 
each input might be a line from a document, but the pri- 
vacy guarantee should apply to whole documents. Differ- 
ential privacy extends to privacy groups via composabil- 
ity: the effect of n input items on the output is at most n 
times the effect of a single item. 


4.1 Function sensitivity 


A function’s sensitivity measures the maximum change 
in the function’s output when any single item is removed 
from or added to its input dataset. Intuitively, the more 
Sensitive a function, the more information it leaks about 
the presence or absence of a particular input. Therefore, 
more sensitive functions require the addition of more ran- 
dom noise to their output to achieve differential privacy. 
Formally, the sensitivity of a function f : D — R* is 


/ 
ALT= mes |) FP 
for any D, D’ that are identical except for a single ele- 
ment, which is present in D, but not in D’. In this paper, 
we will be primarily interested in functions that produce 
a single output, i.e., k = 1. 
Many common functions have low sensitivity. For ex- 


ample, a function that counts the number of elements sat- 
isfying a certain predicate has sensitivity 1 (because the 


count can change by at most | when any single element is 
removed from the dataset). The sensitivity of a function 
that sums up integers from a bounded range is the max- 
imum value in that range. Malicious functions that aim 
to leak information about an individual input or signal its 
presence in the input dataset are likely to be sensitive be- 
cause their output must necessarily differentiate between 
the datasets in which this input is present and those in 
which it is not present. 

In general, estimating the sensitivity of arbitrary un- 
trusted code is difficult. Therefore, we require the com- 
putation provider to furnish the range of possible outputs 
for his mappers and use this range to derive estimated 
sensitivity. Estimated sensitivity is then used to add suf- 
ficient random noise to the output and guarantee privacy 
regardless of what the untrusted code does. If the code 
is malicious and attempts to output values outside its de- 
clared range, the enforcement mechanism will chose a 
value within the range. The computation still guarantees 
privacy, but the results may no longer be accurate (8 5.1). 


Sensitivity of SUM. Consider a use of SUM that takes 
as input 100 integers and returns their sum. If we know in 
advance that the inputs are all O or 1, then the sensitivity 
of SUM is low because the result varies at most by | de- 
pending on the presence of any given input. Only a little 
noise needs to be added to the sum to achieve privacy. 

In general, the sensitivity of SUM is determined by the 
largest possible input. In this example, if one input could 
be as big as 1,000 and the rest are all 0 or 1, the proba- 
bility of outputting any given sum should be almost the 
same with or without 1,000. Even if all actual inputs are 
0 or 1, a lot of noise must be added to the output of SUM 
in order to hide whether 1,000 was among the inputs. 

Differential privacy works best for low-sensitivity 
computations, where the maximum influence any given 
input can have on the output of the computation is low. 


4.2 Privacy budget 


Data providers may want an absolute privacy guarantee 
that holds regardless of the number and nature of compu- 
tations carried out on the data. Unfortunately, an abso- 
lute privacy guarantee cannot be achieved for meaningful 
definitions of privacy. A fundamental result by Dinur and 
Nissim [10] shows that the entire dataset can be decoded 
with a linear (in the size of the dataset) number of queries. 
This is a serious, but inevitable, limitation. Existing pri- 
vacy mechanisms which are not based on differential pri- 
vacy either severely limit the utility of the data, or are 
only secure against very restricted adversaries (see [13] 
and Section 9). 

The composability of differential privacy and the need 
to restrict the number of queries naturally give rise to the 
concept of a “privacy budget” (Pp) [17, 38]. Each differ- 
entially private computation with a privacy parameter of € 
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results in subtracting € from this budget. Once the privacy 
budget is exhausted, results can no longer be automati- 
cally declassified. The need to pre-specify a limit on how 
much computation can be done over a given dataset con- 
strains some usage scenarios. We emphasize that there 
are no definitions of privacy that are robust, composable, 
and achievable in practice without such a limit. 

After the privacy budget has been exhausted, Airavat 
still provides useful functionality. While the output can 
no longer be automatically declassified without risking 
a privacy violation, Airavat still enforces access control 
restrictions on the untrusted code and associates proper 
access control labels with the output. In this case, out- 
puts are no longer public and privacy protection is based 
solely on mandatory access control. 


5 Enforcing differential privacy 


Airavat supports both trusted and untrusted mappers. Be- 
cause reducers are responsible for enforcing privacy, they 
must be trusted. The computation provider selects a re- 
ducer from a small set included in the system. 

The outputs of mappers and reducers are lists of key/- 
value pairs. An untrusted, potentially malicious mapper 
may try to leak information about an individual input by 
encoding it in (1) the values it outputs, (2) the keys it out- 
puts, (3) the order in which it outputs key/value pairs, or 
(4) relationships between output values of different keys. 

MapReduce keys are arbitrary strings. Airavat cannot 
determine whether a key encodes sensitive information. 
The mere presence of a particular key in the output may 
signal information about an individual input. Therefore, 
Airavat never outputs any keys produced by untrusted 
mappers. Instead, the computation provider submits a 
key or list of keys as part of the query and Airavat returns 
(noisy) values associated with these keys. As explained 
below, Airavat always returns a value for every key in 
the query, even if none of the mappers produced this key. 
This prevents untrusted mappers from signaling informa- 
tion by adding or removing keys from their output. 

For example, Airavat can be used to compute the noisy 
answer to the query “What is the total number of iPods 
and pens sold today?” (see the example in Section 5.4) 
because the two keys iPod and pen are declared as 
part of the computation. The query “List all items and 
their sales” is not allowed in Airavat, unless the mapper 
trusted. The reason is that a malicious mapper can leak 
information by encoding it in item names. 

Trusted Airavat reducers always sort keys prior to out- 
putting them. Therefore, a malicious mapper cannot use 
key order as a channel to leak information about a partic- 
ular input record. 

A malicious mapper may attempt to encode informa- 
tion by emitting a certain combination of values associ- 
ated with different keys. As explained below, trusted re- 
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ducers use the declared output range of mappers to add 
sufficient noise to ensure differential privacy for the out- 
puts. In particular, a combination C’ of output values 
across multiple keys does not leak information about any 
given input record r because the probability of Airavat 
producing C’ is approximately the same with or without r 
in the input dataset. 

In the rest of this section, we explain how Airavat en- 
forces differential privacy for computations involving un- 
trusted mappers. We use BigShop from Section 2 as our 
running example. We also briefly describe a broader class 
of differentially private computations which can be im- 
plemented using trusted mappers. 


5.1 Range declarations and estimated sensitivity 


Airavat reducers enforce differential privacy by adding 
exponentially distributed noise to the output of the com- 
putation. The sensitivity of the computation determines 
the amount of noise: the noise must be sufficient to mask 
the maximum influence that any single input record can 
have on the output (8 4.1). 

In the case of untrusted mappers, the function(s) 
they compute and their sensitivity are unknown. To 
help Airavat estimate sensitivity, we require the com- 
putation provider to declare the range of output values 
(Mmin; Mmazx) that his mapper can produce. Airavat 
combines this range with the sensitivity of the function 
implemented by the trusted reducer (Red) into estimated 
sensitivity. For example, estimated sensitivity of the SUM 
reducer is max(|Mmaz|, |Mmin|), because any single in- 
put can change the output by at most this amount. 

The declared mapper range can be greater or smaller 
than the true global sensitivity of the function computed 
by the mapper. While global sensitivity measures the 
output difference between any two inputs that differ in 
at most one element (8 4.1), the mapper range captures 
the difference between any two inputs. That said, the 
computation provider may assume that all inputs for the 
current computation lie in a certain subset of the func- 
tion’s domain, so the declared range may be lower than 
the global sensitivity. In our clustering case study (8 8.4), 
such an assumption allows us to obtain accurate results 
even though global sensitivity of clustering is very high 
(on “bad” input datasets, a single point can significantly 
change the output of the clustering algorithms). 

The random noise added by Airavat to the output 
of MapReduce computations is a function of the data 
provider’s privacy parameter € and the estimated sen- 
sitivity. For example, Airavat’s SUM reducer adds 
noise from the Laplace distribution, Lap(2), where 6 = 
max(|Mmaz|; |Mmin|). 





Example. In the BigShop example, Bob writes his own 
mapper and uses the SUM reducer to compute the total 
number of orders placed on date D. Assuming that a cus- 
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tomer can order at most 25 items on any single day, Bob 
declares his mapper range as (0, 25). The estimated sen- 
sitivity 1s 25 because the presence or absence of a record 
can affect the order total by at most 25. 


Privacy groups. In the BigShop example, we may 
want to provide privacy guarantees at the level of cus- 
tomers rather than records (a single customer may have 
multiple records). Airavat supports privacy groups (8 4), 
which are collections of records that are jointly present 
or absent in the dataset. The data provider supplies a pro- 
gram (PG) that takes a record as input and emits the cor- 
responding group identifier, gid. Airavat attaches these 
identifiers to key/value pairs to track the dispersal of 1n- 
formation from each input privacy group through inter- 
mediate keys to the output. The mapper range declared 
by the computation provider is interpreted at the group 
level. For example, suppose that each BigShop record 
represents a purchase, a customer can make at most 10 
purchases a day, and each purchase contains at most 25 
orders. If all orders of a single customer are viewed as a 
privacy group, then the mapper range is (0, 250). 


5.2. Range enforcement 


To prevent malicious mappers from leaking information 
about inputs through their output values, Airavat asso- 
ciates a range enforcer with each mapper. The range en- 
forcer checks that the value in each key/value pair output 
by the mapper lies within its declared range. This check 
guarantees that the actual sensitivity of the computation 
performed by the mapper does not exceed the estimated 
sensitivity, which is based on the declared range. Ifa 
malicious mapper outputs a value outside the range, the 
enforcer replaces it with a value inside the range. In the 
latter case, differential privacy holds, but the computation 
may no longer produce accurate or meaningful results. 

Range enforcement in Airavat prioritizes privacy over 
accuracy. If the computation provider declares the range 
incorrectly, the computation remains differentially pri- 
vate. However, the results are not meaningful and the 
provider gets no feedback about the problem, because 
any such feedback would be an information leak. The 
lack of feedback may seem unsatisfying, but other sys- 
tems that tightly regulate information flow make similar 
tradeoffs. For example, MAC systems Flume and As- 
bestos make pipes (used for interprocess communication) 
unreliable and do not give the user any feedback about 
their failure because such feedback would leak informa- 
tion [29, 49]. 

Providing a mapper range is simple for some compu- 
tations. For example, Netflix movie ratings (88.3) are al- 
ways between 1 and 5. When computing the word count 
of a set of documents, however, estimating the mapper 
range is more difficult. If each document is at most NV 
words, and the document is a privacy group, then the 
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Figure 2: Simplified overview of range enforcers and noise 
generation. Trusted components are shaded. 


O — N range will guarantee privacy of individual doc- 
uments. Depending on the number of documents, such 
a large range may result in adding excessive noise. For 
some domains, it might not be possible to obtain a rea- 
sonable estimate of the mapper’s range. Airavat gives 
accurate results only when the computation provider un- 
derstands the sensitivity of his computation. 

In the BigShop example, the range enforcer ensures 
that in every (K, V) pair output by the mapper, 0 < V < 
25. Suppose a malicious mapper attempts to leak infor- 
mation by outputting 1, 000 when Alice’s record is in the 
input dataset and O otherwise. Because 1, 000 is outside 
the declared range, the range enforcer will replace it with, 
say, 12.5. The difference between 0 and 12.5 is less than 
the estimated sensitivity. Therefore, enough noise will 
be added so that one cannot tell, by looking at the out- 
put, whether this output was obtained by adding noise to 
12.5 or 0. The noisy output thus does not reveal whether 
Alice’s record was present in the input dataset or not. 


Distributed range enforcement. A single MapReduce 
operation may execute mappers on many different ma- 
chines. These mappers may process input elements with 
the same key or privacy group. Airavat associates a range 
enforcer with each mapper and merges their states at the 
end of the map phase. After merging, Airavat ensures that 
the values corresponding to each key or privacy group are 
within the declared range (see Figure 2). 


Example: “noisy sum.” Figure 3 illustrates differen- 
tial privacy enforcement with an untrusted mapper and 
the SUM reducer. This “noisy sum” primitive was shown 
by Blum et al. [6] to be sufficient for privacy-preserving 
computation of all algorithms in the statistical query 
model [27], including k-Means, Naive Bayes, principal 
component analysis, and linear classifiers such as percep- 
trons (for a slightly different definition of privacy). 

Each input record is its own privacy group. The com- 
putation provider supplies the implementation of the ac- 
tual mapper function Map, which converts every input 
record into a list of key/value pairs. 
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// Inputs and definitions 

DB, «, 0=0, Ps, 
Computation provider: Map, Myin, 
Airavat: SUM (trusted reducer, Red) 
b = max(|Mmac|,|Mmin|) 

be = (Maz Mmin)/2 


Data owner: 


IMI pase v N 


// Map phase 

if(Pp-exN < Oj{ 
print ‘*‘*Privacy limit exceeded’’; 
TERMINATE 

} 

Pp = Pp —-exN 

For (Record r in DB){ 
(hey. U6 aseas has. Ue) 
For(i2 1 te nq 

if iv: = Maa 6 1 SS 
vi = bf 


= Map (r) 
Mmacz) { 


i 


emit (ko, vo) abe (his Un) 


// Reduce phase 

count = N 

Reduce (Key k, List val) { 
if(--count < 0) { Skip } 


V = SUM(val) 
Princ V+ Lap (2) 
} 
for(i2 count to. 0) 4 


print Lap (2) } 


Figure 3: Simplified pseudo-code demonstrating differential 
privacy enforcement. 


5.3. Mapper independence 

Airavat forces all invocations of a mapper in a given 
MapReduce computation to be independent. Only a sin- 
gle input record is allowed to affect the key/value pairs 
output by the mapper. The mapper may not store the 
key/value pair(s) produced from an input record and use 
them later, when computing the key/value pair for an- 
other record. Without this restriction, estimated sensitiv- 
ity used in privacy enforcement may be lower than the ac- 
tual sensitivity of the mapper, resulting in a potential pri- 
vacy violation. Mappers can only create additional keys 
for the same input record, they cannot merge information 
contained in different input records. We ensure mapper 
independence by modifying the JVM (8 7.3). 


Each mapper is permitted by the Airavat JVM to ini- 
tialize itself once by overriding the configure func- 
tion, called when the mapper is instantiated. To ensure 
independence, during initialization the mapper may not 
read any files written in this MapReduce computation. 
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5.4 Managing multiple outputs 


A MapReduce computation may output more than one 
key/value pair (e.g., Figure 3). The computation provider 
must specify the number of output keys (N) beforehand; 
otherwise, the number of outputs can become a channel 
through which a malicious mapper can leak information 
about the inputs. If a computation produces more (fewer) 
than the declared number of outputs, then Airavat re- 
moves (creates) outputs to match the declared value. 

Range restrictions are enforced separately for each 
(privacy group, key) pair. Therefore, random noise is in- 
dependently added to all values associated with the final 
output keys. Recall that Airavat never outputs a key pro- 
duced by an untrusted mapper. Instead, the computation 
provider must specify a key or list of keys as part of the 
query, and Airavat will return the noisy values associated 
with each key in the query. For such queries, N can be 
calculated automatically. 

In general, each output represents a separate release 
of information about the same input. Therefore, Airavat 
must subtract more than one € from the privacy budget 
(see Figure 3). If different outputs are based on disjoint 
parts of the input, then smaller deductions from the pri- 
vacy budget are needed (see below). 


Example. Consider the BigShop example, where each 
record includes the customer’s name, a list of products, 
and the number of items bought for each product (e.g., 
[Joe, iPod, 1, pen, 10]). The privacy group is the cus- 
tomer, and each customer may have multiple records. 
Bob wants to compute the total number of iPods and pens 
sold. Bob must specify that he expects two outputs. If he 
specifies the keys for these outputs as part of the query 
(e.g., “iPod” and “pen’’), then the keys will be printed. 
Otherwise, only the values will be printed. Airavat sub- 
tracts 2e from the privacy budget for this query. 

Bob’s mapper, after reading a record, outputs the prod- 
uct name and the number of sold items (e.g., (iPod, 1), 
(pen, 10)—note that more than one key/value pair is out- 
put for each record). Bob also declares the mapper range 
for each key, e.g., (0,5) for the number of iPods bought 
and (0,25) for the number of pens. Airavat range en- 
forcers automatically group the values by customer name 
and enforce the declared range for each item count. The 
final reducer adds random noise to the total item counts. 


Computing on disjoint partitions of the nput. When 
different outputs depend on disjoint parts of the input, the 
MapReduce computation can be decomposed into inde- 
pendent, parallel computations on independent datasets, 
and smaller deductions from the privacy budget are suffi- 
cient to ensure differential privacy. To help Airavat track 
and enforce the partitioning of the input, the computation 
provider must (1) supply the code (PC) that assigns input 
records to disjoint partitions, and (2) specify which of the 
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declared outputs will be based on which partition (PM). 

The PC code is executed as part of the initial map- 
per. For each key/value pair generated by a mapper, Aira- 
vat constructs records of the form (key, value, gid, pid), 
where gid is the privacy group identifier and pid is the 
partition identifier. 

The computation provider declares which partition 
produces which of the N final outputs (PM). Airavat uses 
PM to calculate p, the maximum number of final outputs 
that depend on any single partition. If PC and PM are 
not provided, Airavat sets p to equal N. Airavat charges 
« x min(N, p) from the privacy budget. For example, 
a computation provider may partition the BigShop data 
into two cities Austin and Seattle which act as the 
partition identifiers. He then specifies that the MapRe- 
duce computation will have 8 outputs, the first five of 
which are calculated from the Austin partition and the 
next three from the Seattle partition. In this example, 
N = 8, p = 5, and to run the computation, Airavat will 
subtract 5¢€ from the privacy budget. In Figure 3, « x N 
is charged to the privacy budget because the NV outputs 
depend on the entire input, not on disjoint partitions. 

Airavat enforces the partitioning declared by the com- 
putation provider. Trusted Airavat reducers use partition 
identifiers to ensure that only key/value pairs that have 
the correct pid are combined to generate the output. 
Airavat uses PM for computations on disjoint partitions 
in the same way as it uses N for unpartitioned data. If the 
number of outputs for a partition is less (more) than what 
is specified by PM, Airavat adds (deletes) outputs. 


5.5 Trusted reducers and reducer composition 


Trusted reducers such as SUM and COUNT are executed 
directly on the output of the mappers. The computation 
provider can combine these reducers with any untrusted 
mapper, and Airavat will ensure differential privacy for 
the reducer’s output. For example, to calculate the total 
number of products sold by BigShop, the mapper will be 
responsible for the parsing logic and manipulation of the 
data. COUNT is a special case of SUM where the output 
range is {0,1}. MEAN is computed by calculating the 
SUM and dividing it by the COUNT. 

Reducers can be composed _ sequentially. 
THRESHOLD, K-COUNT, and K-SUM reducers are 
most useful when applied to the output of another 
reducer. THRESHOLD prints the outputs whose value is 
more than C’, where C' is a parameter. K-COUNT counts 
the number of records, and K-SUM sums the values 
associated with each record. For example, to count the 
number of distinct words occurring in a document, one 
can first write a MapReduce computation to group the 
words and then apply K-COUNT to calculate the number 
of groups. The sensitivity of K-COUNT is equal to the 
maximum number of distinct keys that a mapper can 


output after processing any input record. 
5.6 Enforcing 6 


Privacy guarantees associated with the THRESHOLD re- 
ducer may have non-zero 0. Intuitively, 0 bounds the 
probability that the values generated from a given record 
will exceed the threshold and appear in the final output. 
Assuming that the mapper outputs at most n keys after 
processing a single record and the threshold value is C,, 


n C 
O< 5 oxP G (1 — =) 


The proof is omitted because of space constraints and ap- 
pears in a technical report [46]. When the computation 
provider uses the THRESHOLD reducer, Airavat first cal- 
culates the value of 0. If itis less than the bound specified 
by the data provider, then computation can proceed; oth- 
erwise it is aborted. 


5.7 Mapper composition 


Multiple mappers {M,,..., /;} can be chained one af- 
ter another, followed by a final reducer A. Each mapper 
after the initial mapper propagates the partition identi- 
fier (01d) and privacy group (gid) values from the input 
record to output key/value pairs. Airavat enforces the de- 
clared range for the output of the final mapper /;. Noise 
is added only once by the final reducer f;. 

To reduce the charge to the privacy budget, the compu- 
tation provider can specify the maximum number of keys 
n that any mapper can output after reading records from 
a single privacy group. If provided, Airavat will enforce 
that maximum. If n is not provided, Airavat sets n equal 
to N. If a mapper generates more than n key/value pairs, 
Airavat will only pass n randomly selected pairs to the 
next mapper. 

When charging the total cost of a composed computa- 
tion to the privacy budget, Airavat uses € x min(N, p, n’) 
where 7 is the number of composed mappers, p is the 
maximum number of outputs from any partition (85.4), 
and N is the total number of output keys. If the com- 
putation provider supplies the optional arguments, then 
N > p > ®w results in a more economical use of the 
privacy budget. 


MapReduce composition not supported. Airavat 
supports composition of mappers and composition of re- 
ducers, but not general composition of MapReduce com- 
putations (i.e., reducer followed by another mapper). For 
many reducers, the output of a MapReduce depends on 
the inputs in a complex way that Airavat cannot easily 
represent, making sensitivity calculations difficult. 

In the future, we plan to investigate MapReduce com- 
position for reducers that do not combine information as- 
sociated with different keys (e.g., those corresponding to 
a “select” statement). 
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5.8 Choosing privacy parameters 


Providers of sensitive data must supply privacy parame- 
ters € and 0, as well as the privacy budget Pz, in order for 
their data to be used in an Airavat computation. These pa- 
rameters are part of the differential privacy model. They 
control the tradeoff between accuracy and privacy. It is 
not possible to give a generic recommendation for set- 
ting their values because they are highly dependent on the 
type of the data, the purpose of the computation, privacy 
threats that the data provider is concerned about, etc. 

As € increases, the amount of noise added to the output 
decreases. Therefore, the output becomes more accurate, 
but there is a higher chance that it reveals the presence of 
a record in the input dataset. In many cases, the accuracy 
required determines the minimum e-privacy that can be 
guaranteed. For example, in Section 8.5 we classify doc- 
uments in a privacy-preserving fashion. Our experiments 
show that to achieve 95% accuracy in classification, we 
need to set € greater than 0.6. 

Intuitively, 0 bounds the probability of producing an 
output which can occur only as a result of a particular 
input (see Section 4). Clearly, such an output immedi- 
ately reveals the presence of the input in question. In 
many computations—for example, statistical computa- 
tions where each input datapoint is a single number—0d 
should be set to 0. In our AOL experiment (88.2), which 
outputs the search queries that occur more than a thresh- 
old number of times, 0 is set to a value close to the num- 
ber of unique users. This value of 0 bounds the proba- 
bility that a single user’s privacy is breached due to the 
release of his search query. 

The privacy budget (Pp) 1s finite. If data providers 
specify a single privacy budget for all computation 
providers, then one provider can exhaust more than its 
fair share. Data providers could specify privacy budgets 
for each computation provider to ensure fairness. Man- 
aging privacy budgets is an administrative issue inherent 
to all differential privacy mechanisms and orthogonal to 
the design of Airavat. 


5.9 Computing with trusted mappers 


While basic differential privacy only applies to computa- 
tions that produce numeric outputs, it can be generalized 
to discrete domains (e.g., discrete categories or strings) 
using the exponential mechanism of McSherry and Tal- 
war [40]. In general, this requires both mappers and re- 
ducers to be trusted, because keys are an essential part 
of the system’s output. Our prototype includes an im- 
plementation of this mechanism for simple cases, but we 
omit the definition and discussion for brevity. 

As acase study, one of the authors of this paper ported 
CloudBurst, a genome mapping algorithm written for 
MapReduce [47], to Airavat. The CloudBurst code con- 
tains two mappers and two reducers (3,500 lines total, 
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including library routines). The mappers are not indepen- 
dent and the reducers perform non-trivial computation. 

The entire system was ported in a week. If a reducer 
was non-trivial, it was replaced by an identity reducer 
and its functionality was executed as the second mapper 
stage. This transformation was largely syntactic. Some 
work was required to make the mappers independent, and 
about 50 lines of code had to be added for differential pri- 
vacy enforcement. 


6 Enforcing mandatory access control 


This section describes how Airavat confines MapRe- 
duce computations, preventing information leaks via sys- 
tem resources by using mandatory access control mecha- 
nisms. Airavat uses SELinux to execute untrusted code in 
a sandbox-like environment and to ensure that local and 
HDFS files are safeguarded from malicious users. While 
decentralized information flow control (DIFC) [45, 49, 
51] would provide far greater flexibility for access con- 
trol policies within Airavat, only prototype DIFC oper- 
ating systems exist. By contrast, SELinux is a broadly 
deployed, mature system. 


6.1 SELinux policy 


Airavat’s SELinux policy creates two domains, one 
trusted and the other untrusted. The trusted components 
of Airavat, such as the MapReduce framework and DFS, 
execute inside the trusted domain. These processes can 
read and write trusted files and connect to the network. 
Untrusted components, such as the user-provided map- 
per, execute in the untrusted domain and have very lim- 
ited permissions. 

Table 2 shows the different domains and how they 
are used. The airavatT_t type is a trusted domain 
used by the MapReduce framework and the distributed 
file system. Airavat labels executables that launch the 
framework and file system with the airavatT_exec_t 
type so the process executes in the trusted domain. This 
trusted domain reads and writes only trusted files (la- 
beled with airavatT_rw_t). No other domain is al- 
lowed to read or write these files. For example, the dis- 
tributed file system stores blocks of data in the underlying 
file system and labels files containing those blocks with 
alravatT_rw_t. 

In certain cases Airavat requires the trusted domain 
to create configuration files that can later be read by 
untrusted processes for initialization. Airavat uses the 
airavatT_notsec_t domain to label configuration 
files which do not contain any secrets but whose integrity 
is guaranteed. Since MapReduce requires network com- 
munication for transferring data, our policy allows net- 
work access by the trusted domain. 

Only privileged users may enter the trusted do- 
main. To implement this restriction, Airavat cre- 
ates a trusted SELinux user called airavat_user. 


USENIX Association 


USENIX Association 


“Domain [Object tabeted” [Remark 
Trusted domain. Can access airavatT_*_t and common domains like sockets, networking, etc. 


airavatT _notsec_t Used to protect configuration files that contain no secrets. Can be read by untrusted code for 
initialization. 


airavatU t Process 


Untrusted domain. Can access only airavatT_notsec_t and can read and write to pipes of the aira- 
vatT_t domain. 


Executable Used to transition to the airavatU_t domain. 





User type Trusted user who can transition to the airavatT_t domain. 


Table 2: SELinux domains defined in Airavat and their usage. 


Only airavat_user can execute files labeled with 
airavatT_exec_t and transition to the trusted do- 
main. The system administrator maps a Linux user to 
alravat_user. 

The untrusted domain, airavatU_t, has very few 
privileges. A process in the untrusted domain can- 
not connect to the network, nor read or write files. 
There are two exceptions to this rule. First, the un- 
trusted domain can read configuration files of the type 
airavatT_notsec_t. Second, it can communicate 
with the trusted domain using pipes. All communica- 
tion with the mapper happens via these pipes which are 
established by the trusted framework. A process can en- 
ter the untrusted domain by executing a file of the type 
airavatU_exec_t. In our implementation, the frame- 
work transitions to the untrusted domain by executing the 
JVM that runs the mapper code. 

Each data provider labels its input files (DB) with 
a domain specific to that provider. Only the trusted 
airavatT_t domain can read files from all providers. 
The output of a computation is stored in a file la- 
beled with the trusted domain airavatT_rw_t. Data 
providers may set their declassify flag if they agree to 
declassify the result when Airavat guarantees differen- 
tial privacy. If all data providers agree to declassify, then 
the trusted domain label is removed from the result when 
differential privacy holds. If only a subset of the data 
providers agree to declassify, then the result is labeled by 
a new domain, restricted to entities that have permission 
from all providers who chose to retain their label. Since 
creating domains in SELinux is a cumbersome process, 
our current prototype only supports full declassification. 
DIFC makes this ad hoc sharing among domains easy. 


6.2 Timing channels 


A malicious mapper may leak data using timing chan- 
nels. MapReduce is a batch-oriented programming style 
where most programs do not rely on time. The bandwidth 
of covert timing channels is reduced by making clocks 
noisy and low-resolution [25]. Airavat currently denies 
untrusted mappers access to the high-resolution proces- 
sor cycle counter (TSC), which is accessed via Java APIs. 
A recent timing attack requires the high-definition pro- 


cessor counter to create a channel with 0.2 bits per sec- 
ond capacity [44]. Without the TSC, the data rate drops 
three orders of magnitude. 

We are working to eliminate all obvious time-based 
APIs from the Airavat JVM for untrusted mappers, 
including System.currentTimeMillis. We as- 
sume an environment like Amazon’s elastic MapReduce, 
where the only interface to the system is the MapRe- 
duce programming interface and untrusted mappers are 
the only untrusted code on the system. Untrusted map- 
pers cannot create files, so they cannot use file metadata 
to measure time. Airavat eliminates the API through 
which programs are notified about garbage collection 
(GC), so untrusted code has only indirect evidence about 
GC through the execution of finalizers, weak, soft, and 
phantom references (no Java native interface calls are al- 
lowed). Channels related to GC are inherently noisy and 
are controlled by trusted software whose implementation 
can be changed if it is found to leak too much timing in- 
formation. 

Airavat does not block timing channels caused by 
infinite loops (non-termination). Such channels have 
low bandwidth, leaking one bit per execution. Cloud 
providers send their users billing information (including 
execution time) which may be exploited as a timing chan- 
nel. Quantizing billing units (e.g., billing in multiples of 
$10) and aggregating billing over long time periods (e.g., 
monthly) greatly reduce the data rate of this channel. A 
computer system cannot completely close all time-based 
channels, but a batch-oriented system like MapReduce 
where mappers may not access the network can decrease 
the utility of timing channels for the attacker to a point 
where another attack vector would appear preferable. 


7 Implementation 


The Airavat implementation includes modifications to the 
Hadoop MapReduce framework and Hadoop file system 
(HDFS), a custom JVM for running user-supplied map- 
pers, trusted reducers, and an SELinux policy file. In 
our prototype, we modified 2,000 lines of code in the 
MapReduce framework, 3, 000 lines in HDFS, and 500 
lines of code in the JVM. The SELinux policy is ap- 
proximately 450 lines that include the type enforcement 
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rules and interface declarations. This section describes 
the changes to the HDFS, implementation details of the 
range enforcers, and JVM modifications. 


7.1 HDFS modifications 


An HDFS cluster consists of a single NameNode server 
that manages the file system namespace and a number 
of DataNode servers that store file contents. HDFS cur- 
rently supports file and directory permissions that are 
similar to the discretionary access control of the POSIX 
model. Airavat modifies HDFS to support MAC labels, 
by placing them in the file inode structure. Inodes are 
stored in the NameNode server. Any request for a file op- 
eration by a client is validated against the inode label. In 
the DataNodes, Airavat adds the HDFS label of the file to 
the block information structure. 


7.2 Enforcing sensitivity 


As described in Section 5.1, each mapper has an associ- 
ated range enforcer. The range enforcer determines the 
group for each input record and tags the output produced 
by the mapper with the gid. In the degenerate case when 
each input belongs to a group of its own, each output by 
the mapper is given a unique identifier as its gid. The 
range enforcer also determines and tags the outputs with 
the partition identifier, pid. The default is to tag each 
record as belonging to the same partition. 

During the reduce phase, each reducer fetches the 
sorted key/value pairs produced by the mappers. The re- 
ducer then uses the gid tag to group together the output 
values. Any value that falls outside the range declared by 
the computation provider (Minin... Mmazx) 18 replaced 
by a value inside the range. Such a substitution (if it 
happens) prioritizes privacy over accuracy (§ 5.1). The 
reducer also enforces that only key/value pairs with the 
correct pid are combined to generate the final output. 


7.3. Ensuring mapper independence 


To add the proper amount of noise to ensure differential 
privacy, the result of the mapper on each input record 
must not depend on any other input record (8 5.3). A 
mapper is stateful if it writes a value to storage during an 
invocation and then uses this value in a later invocation. 
Airavat ensures that mapper invocations are not stateful 
by executing them in an untrusted domain that cannot 
write to files or the network. The MAC OS enforces the 
limitation that mappers cannot write to system resources. 

For memory objects, Airavat adds access checks to two 
types of data: objects, which reside on the heap, and stat- 
ics, which reside in the global pool. Airavat modifies the 
Java virtual machine to enforce these checks. Our proto- 
type uses Jikes RVM 3.0.0,” a Java-in-Java research vir- 
tual machine. 


*www.jikesrvm.org 
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Airavat prevents mappers from writing static variables. 
This restriction is enforced dynamically by using write 
barriers that are inserted whenever a static is accessed. 
Airavat modifies the object allocator to add a word to 
each object header. This word points to a 64-bit number 
called the invocation number (ivn). The Aira- 
vat JVM inserts read and write barriers for all objects. 
Before each write, the ivn of the object is updated to the 
current invocation number (which is maintained by the 
trusted framework). Before a read, the JVM checks if the 
object’s ivn is less than the current invocation number. 
If so, then the mapper is assumed to be stateful and the 
JVM throws an exception. After this exception, the cur- 
rent map invocation is re-executed and the final output of 
the MapReduce operation is not differentially private and 
must be protected using MAC (without declassification). 

Jikes RVM is not mature enough to run code as large 
and complex as the Hadoop framework. We therefore use 
Hadoop’s streaming feature to ensure that mappers run on 
Jikes and that most of the framework executes on Sun’s 
JVM. The streaming utility forks a trusted Jikes process 
that loads the mapper using reflection. The Jikes process 
then executes the map function for each input provided by 
the streaming utility. The streaming utility communicates 
with the Jikes process using pipes. This communication 
is secured by SELinux. 


$8 Evaluation 


This section empirically makes the case that Airavat can 
be used to efficiently compute a wide variety of algo- 
rithms in a privacy-preserving manner with acceptable 
accuracy loss. ‘Table 3 provides an overview of the 
case studies. Our experiments show that computations 
in Airavat incur approximately 32% overhead compared 
to those running on unmodified Hadoop and Linux. In 
all experiments except the one with the AOL queries, 
the mappers are untrusted. The AOL experiment outputs 
keys, so we trust the mapper not to encode information in 
the key. 


$8.1 Airavat overheads 


We ran all experiments on Amazon’s EC2 service on a 
cluster of 100 machines. We use the large EC2 instances, 
each with two cores of 1.0—1.2 GHz Opteron or Xeon, 7.5 
GB memory, 850 GB hard disk, and running SELinux- 
enabled Fedora 8. The numbers reported are the average 
of 5 runs, and the variance is less than 8%. K-Means 
and Naive Bayes use the public implementations from 
Apache Mahout.° 

Figure 4 breaks down the execution time for each 
benchmark. The values are normalized to the execution 
time of the applications running on unmodified Hadoop 
and unmodified Linux. The graph depicts the percentage 


>http://lucene.apache.org/mahout/ 
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Privacy grouping | Reducer primitive | #MapReduce computations | Accuracy metric 
AOL queries Multiple % Queries released 
individual rating Multiple RMSE 


Individual points Multiple, till convergence 


Intra-cluster variance 





Individual articles Multiple Misclassification rate 


Table 3: Details of the benchmarks, including the grouping of data, type of reducer used, number of MapReduce phases, and the 


accuracy metric. 
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Figure 4: Normalized execution time of benchmarks when 
running on Airavat, compared to execution on Hadoop. Lower 
is better. 


Benchmark | JVM Total Time (sec) 
overhead overhead 


36.3% 23.9% | 228 +3 


43.2% 19.6% | 1080 +6 


28.5% 29.4% | 154 +7 
37.4% 32.3% | 94 +2 


Table 4: Performance details. 





of the total time spent in different phases, such as map, 
sort, and reduce. The category Copy represents the phase 
where the output data from the mappers is copied by the 
reducer. Note that the copy phase generally overlaps with 
the map phase. The benchmarks show that Airavat slows 
down the computation by less than 33%. 

Table 4 measures the performance overhead of enforc- 
ing differential privacy. The JVM instrumentation, to en- 
sure mapper independence, adds up to 44% overhead in 
the map phase. 


8.2 Queries on AOL dataset 


Recently, Korolova et al. showed how to release search 
queries while preserving privacy [28]. They first find the 
frequency of each query and then output the noisy count 
of those that exceed a certain threshold. Intuitively, the 
threshold suppresses uncommon, low-frequency queries, 
since such queries are likely to breach privacy. 

We demonstrate how Airavat can perform similar com- 
putations on the AOL dataset, while ensuring differential 
privacy. Airavat does not output non-numeric values if 
the mapper is untrusted because non-numeric values can 


leak information (85). The outputs of this experiment 
are search queries (which are non-numeric) and their fre- 
quencies, sO we assume that the mapper is trusted. We 
use SUM and THRESHOLD as reducers to generate the 
frequency of distinct queries and then output those that 
exceed the threshold. The privacy group is the user, and 
M is the maximum number of search queries made by 
any single user. The mapper range is (0, 7). We vary M 
in our experiments. 

Our experiments use the AOL data for the first week 
of April 2006 (253K queries). Since we use the thresh- 
old function, Airavat needs a non-zero 6 as input. We 
chose 6 = 10~° based on the number of unique users for 
this week, 24,861. Fixing the value of € and 6 also de- 
termines the minimum threshold to ensure privacy. The 
exact threshold value can be calculated from the formula 
in section 5.5: C = M(1 — mar) y, 

It is possible that a single user may perform an un- 
common search multiple times (e.g., 1f he searches for 
his name or address). Releasing such search queries can 
compromise the user’s privacy. The probability of such a 
release can be reduced by increasing V/ and/or setting a 
low value of 0. A large value of / implies that the re- 
lease threshold C’ is also large, thus reducing the chance 
that an uncommon query will be released. 

In our experiments, we show the effect of different 
parameters on the number of queries that get published. 
First, we vary /, the maximum number of search queries 
that belong to any one user. Figure 5(a) shows that as 
we increase the value of /, the threshold value also in- 
creases, resulting in a smaller number of distinct queries 
being released. Second, we vary the privacy parameter e. 
AS we increase €, i.e., decrease the privacy restrictions, 
more queries can be released. Note that fewer than 1% 
of total unique queries (109K) are released. The reason 
is that most queries are issued very few times and hence 
cannot be released without jeopardizing the privacy of 
users who issued them. 


8.3. Covariance matrices 


Covariance matrices find use in many machine-learning 
computations. For example, McSherry and Mironov re- 
cently showed how to build a recommender system that 
preserves individual privacy [39]. The main idea is to 
construct a covariance matrix in a privacy-preserving 
fashion and then use a recommender algorithm such as 
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Figure 5: Effect of privacy parameter on the (a) number of released AOL search queries, (b) accuracy in RMSE in kNN recom- 
mender system (lower is better) and (c) accuracy in k-Means and Naive Bayes (higher is better). 


k-nearest neighbor (KNN) on the matrix. 

We picked 1, 000 movies from the Netflix prize dataset 
and generated a covariance matrix using Airavat. The 
computation protects the privacy of any individual Net- 
flix user. We cannot calculate the complete matrix in one 
computation using the Airavat primitives. Instead, we fill 
the matrix cell by individual cell. The disadvantage of 
this approach is that the privacy budget is expended very 
quickly. For example, if the matrix has M/? cells, then we 
subtract «M7 from the privacy budget (equivalently, we 
achieve €/?-differential privacy). 

Because each movie rating is between 1 and 5 and an 
entry of the covariance matrix is a product of two such 
ratings, the mapper range is (0,25). Figure 5(b) plots 
the root mean squared error (RMSE) of the kKNN algo- 
rithm when executed on the covariance matrix generated 
by Airavat. The x-axis corresponds to the privacy guar- 
antee for the complete covariance matrix. Our results 
show that with the guarantee of 5-differential privacy, the 
RMSE of KNN is approximately 0.97. For comparison, 
Netflix’s own algorithm, called Cinematch, has a RMSE 
of 0.95 when applied on the complete Netflix dataset. 


8.4 Clustering Algorithm: k-Means 


The k-Means algorithm clusters input vectors into k par- 
titions. The partitioning aims to minimize intra-cluster 
variances. We use Lloyd’s iterative heuristic to com- 
pute k-Means. The algorithm proceeds in two steps [6]. 
In the first step, the cardinality of each cluster is calcu- 
lated. In the second step, all points in the new cluster are 
added up and then divided by the cardinality derived in 
the previous step, producing new cluster centers. The in- 
put dataset consists of 600 examples of control charts.’ 
Control charts are used to assess whether a process is 
functioning properly. Machine learning techniques are 
often applied to such charts to detect anomaly patterns. 
Figure 5(c) plots the accuracy of the k-Means algo- 
rithm as we change the privacy parameter «. We assume 


*http://archive.ics.uci.edu/ml/databases/synthetic_control 
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that each point belongs to a different user whose privacy 
must be guaranteed. The mapper range of the computa- 
tion that calculates the cluster size is (0, 1). The mapper 
range for calculating the actual cluster centers is bounded 
by the maximum value of any coordinate over all points, 
which is 36 for the current dataset. We measure the ac- 
curacy of the algorithm by computing the intra-cluster 
variance. With « > 0.5, the accuracy of the clustering 
algorithm exceeds 90%. 


8.5 Classification algorithm: Naive Bayes 


Naive Bayes is a simple probabilistic classifier that ap- 
plies the Bayes Theorem with assumptions of strong in- 
dependence. During the training phase, the algorithm is 
given a Set of feature vectors and the class labels to which 
they belong. The algorithm creates a model, which is 
then used in the classification phase to classify previously 
unseen vectors. 

Figure 5(c) plots the accuracy against the privacy pa- 
rameter «. We used the 20newsgroup dataset,” which 
consists of different articles represented by words that 
appear in them. We train the classifier on one partition 
of the dataset and test it on another. The value of « af- 
fects the noise which is added to the model in the train- 
ing phase. We measure the accuracy of the classifier by 
looking at the number of misclassified articles. An arti- 
cle contributes at most 1,000 to a category of words, so 
the range for mapper outputs is (0,1000). Our results 
show that, for this particular dataset, we require € > 0.6 
to achieve 957% accuracy. 


9 Related work 


Differential privacy guarantees are somewhat similar to 
robust or secure statistical estimation, which provides sta- 
tistical computations with low sensitivity to any single 
input (e.g., see [21, 23, 24]). While robust estimators do 
not by themselves guarantee privacy, they can serve as 
the basis for differentially private estimators [16]. 


>http://people.csail.mit.edu/jrennie/20Newsgroups/ 
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In its current version, Airavat requires computation 
providers to provide an upper bound on the sensitivity 
of their code by declaring the range of its possible out- 
puts in advance. An alternative is to have the enforce- 
ment system estimate local, input-specific sensitivity of 
the function computed by the code—either by re-running 
it on perturbed inputs, or by sampling from the input 
space [43]. Local sensitivity measures how much the out- 
put of the function varies on neighboring inputs from a 
subset of the function’s domain. It often requires less 
noise to be added to the output in order to achieve the 
same differential privacy guarantee. We plan to investi- 
gate this approach in future work. 


PINQ. Privacy Integrated Queries (PINQ) is a declar- 
ative system for computing on sensitive data [38] which 
ensures differential privacy for the outputs of the com- 
putation. Airavat mappers are Java bytecode, with re- 
strictions on the programming model enforced at run- 
time. Mapper independence is an example of a restriction 
enforced by the language runtime which is absent from 
PINQ. PINQ provides a restricted programming language 
with a small number of trusted, primitive data operations 
in the LINQ framework. PINQ employs a request/reply 
model, which avoids adding noise to the intermediate re- 
sults of the computation by keeping them on a trusted 
data server or an abstraction of a trusted data server pro- 
vided by a distributed system. 


Airavat’s privacy enforcement mechanisms provide 
end-to-end guarantees, while PINQ provides language- 
level guarantees. Airavat’s enforcement mechanisms in- 
clude all software in the MapReduce framework, includ- 
ing language runtimes, the distributed file system, and the 
operating system. Enforcing privacy throughout the soft- 
ware stack allows Airavat computations to be securely 
distributed across multiple nodes, achieving the scalabil- 
ity that is the hallmark of the MapReduce framework. 
While the PINQ API can be supported in a similar set- 
ting (e.g., DryadLINQ), PINQ’s security would then de- 
pend on the security of Microsoft’s common language 
runtime (CLR), the Cosmos distributed file system, the 
Dryad framework, and the operating system. Securing 
the levels below the language layer would require the 
same security guarantees as provided by Airavat. 


Alternative definitions of privacy. Differential pri- 
vacy is a relative notion: it assures the owner of any 
individual data item that the same privacy violations, if 
any, will occur whether this item is included in the aggre- 
gate computation or not. Therefore, no additional privacy 
risk arises from participating in the computation. While 
this may seem like a relatively weak guarantee, stronger 
properties cannot be achieved without making unjustified 
assumptions about the adversary [11, 12]. Superficially 
plausible but unachievable definitions include “the adver- 


sary does not learn anything about the data that he did not 
know before” [8] and “the adversary’s posterior distribu- 
tion of possible data values after observing the result of 
the computation is close to his prior distribution.” 


Secure multi-party computation [20] ensures that a dis- 
tributed protocol leaks no more information about the 1n- 
puts than is revealed by the output of the computation. 
The goal is to keep the intermediate steps of the com- 
putation secret. This technique is not appropriate in our 
setting, where the goal is to ensure that the output itself 
does not leak too much information about the inputs. 


While differential privacy mechanisms often employ 
output perturbation (adding random noise to the result of 
a computation), several approaches to privacy-preserving 
data mining add random noise to inputs instead. Pri- 
vacy guarantees are usually average-case and do not im- 
ply anything about the privacy of individual inputs. For 
example, the algorithm of Agrawal and Srikant [4] fails 
to hide individual inputs [3]. In turn, Evfimievski et al. 
show that the definitions of [3] are too weak to provide 
individual privacy [18]. 


k-anonymity focuses on non-interactive releases of re- 
lational data and requires that every record in the released 
dataset be syntactically indistinguishable from at least 
k; — 1 other records on the so-called quasi-identifying 
attributes, such as ZIP code and date of birth [7, 48]. 
k-anonymity is achieved by syntactic generalization and 
suppression of these attributes (e.g., [31]). k-anonymity 
does not provide meaningful privacy guarantees. It fun- 
damentally assumes that the adversary’s knowledge is 
limited to the quasi-identifying attributes and thus fails 
to provide any protection against adversaries who have 
additional information [34, 35]. It does not hide whether 
a particular individual is in the dataset [42], nor the sen- 
sitive attributes associated with any individual [32, 34]. 
Multiple releases of the same dataset or mere knowledge 
of the k-anonymization algorithm may completely break 
the protection [19, 52]. Variants, such as /-diversity [34] 
and m-invariance [50], suffer from many of the same 
flaws. 


Program analysis techniques can be used to estimate 
how much information is leaked by a program [36]. Pri- 
vacy in MapReduce computations, however, is difficult 
if not impossible to express as a quantitative information 
flow problem. The flow bound cannot be set at O bits be- 
cause the output depends on every single input. But even 
a 1-bit leakage may be sufficient to reveal, for example, 
whether a given person’s record was present in the input 
dataset or not, violating privacy. By contrast, differential 
privacy guarantees that the information revealed by the 
computation cannot be specific to any given input. 
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10 Conclusion 


Airavat is the first system that integrates mandatory ac- 
cess control with differential privacy, enabling many 
privacy-preserving MapReduce computations without 
the need to audit untrusted code. We demonstrate the 
practicality of Airavat by evaluating it on a variety of case 
studies. 
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Abstract 


MapReduce is a popular framework for data-intensive 
distributed computing of batch jobs. To simplify fault 
tolerance, many implementations of MapReduce mate- 
rialize the entire output of each map and reduce task 
before it can be consumed. In this paper, we propose a 
modified MapReduce architecture that allows data to be 
pipelined between operators. This extends the MapRe- 
duce programming model beyond batch processing, and 
can reduce completion times and improve system utiliza- 
tion for batch jobs as well. We present a modified version 
of the Hadoop MapReduce framework that supports on- 
line aggregation, which allows users to see “early returns” 
from a job as it is being computed. Our Hadoop Online 
Prototype (HOP) also supports continuous queries, which 
enable MapReduce programs to be written for applica- 
tions such as event monitoring and stream processing. 
HOP retains the fault tolerance properties of Hadoop and 
can run unmodified user-defined MapReduce programs. 


1 Introduction 


MapReduce has emerged as a popular way to harness 
the power of large clusters of computers. MapReduce 
allows programmers to think in a data-centric fashion: 
they focus on applying transformations to sets of data 
records, and allow the details of distributed execution, 
network communication and fault tolerance to be handled 
by the MapReduce framework. 

MapReduce is typically applied to large batch-oriented 
computations that are concerned primarily with time to 
job completion. The Google MapReduce framework [6] 
and open-source Hadoop system reinforce this usage 
model through a batch-processing implementation strat- 
egy: the entire output of each map and reduce task is 
materialized to a local file before it can be consumed 
by the next stage. Materialization allows for a simple 
and elegant checkpoint/restart fault tolerance mechanism 


that is critical in large deployments, which have a high 
probability of slowdowns or failures at worker nodes. 

We propose a modified MapReduce architecture in 
which intermediate data is pipelined between operators, 
while preserving the programming interfaces and fault 
tolerance models of previous MapReduce frameworks. 
To validate this design, we developed the Hadoop Online 
Prototype (HOP), a pipelining version of Hadoop.! 

Pipelining provides several important advantages to a 
MapReduce framework, but also raises new design chal- 
lenges. We highlight the potential benefits first: 


e Since reducers begin processing data as soon as it is 
produced by mappers, they can generate and refine 
an approximation of their final answer during the 
course of execution. This technique, known as on- 
line aggregation [12], can provide initial estimates 
of results several orders of magnitude faster than the 
final results. We describe how we adapted online ag- 
gregation to our pipelined MapReduce architecture 
in Section 4. 


Pipelining widens the domain of problems to which 
MapReduce can be applied. In Section 5, we show 
how HOP can be used to support continuous queries: 
MapReduce jobs that run continuously, accepting 
new data as it arrives and analyzing it immediately. 
This allows MapReduce to be used for applications 
such as event monitoring and stream processing. 


Pipelining delivers data to downstream operators 
more promptly, which can increase opportunities for 
parallelism, improve utilization, and reduce response 
time. A thorough performance study is a topic for 
future work; however, in Section 6 we present some 
initial performance results which demonstrate that 
pipelining can reduce job completion times by up to 
25% in some scenarios. 


'The source code for HOP can be downloaded from http: // 
code.google.com/p/hop/ 
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Pipelining raises several design challenges. First, 
Google’s attractively simple MapReduce fault tolerance 
mechanism is predicated on the materialization of inter- 
mediate state. In Section 3.3, we show that this can co- 
exist with pipelining, by allowing producers to periodi- 
cally ship data to consumers in parallel with their mate- 
rialization. A second challenge arises from the greedy 
communication implicit in pipelines, which is at odds 
with batch-oriented optimizations supported by “combin- 
ers’: map-side code that reduces network utilization by 
performing pre-aggregation before communication. We 
discuss how the HOP design addresses this issue in Sec- 
tion 3.1. Finally, pipelining requires that producers and 
consumers are co-scheduled intelligently; we discuss our 
initial work on this issue in Section 3.4. 


1.1 Structure of the Paper 


In order to ground our discussion, we present an overview 
of the Hadoop MapReduce architecture in Section 2. We 
then develop the design of HOP’s pipelining scheme in 
Section 3, keeping the focus on traditional batch process- 
ing tasks. In Section 4 we show how HOP can support 
online aggregation for long-running jobs and illustrate 
the potential benefits of that interface for MapReduce 
tasks. In Section 5 we describe our support for continu- 
ous MapReduce jobs over data streams and demonstrate 
an example of near-real-time cluster monitoring. We 
present initial performance results in Section 6. Related 
and future work are covered in Sections 7 and 8. 


2 Background 


In this section, we review the MapReduce programming 
model and describe the salient features of Hadoop, a 
popular open-source implementation of MapReduce. 


2.1 Programming Model 


To use MapReduce, the programmer expresses their de- 
sired computation as a series of jobs. The input to a job 
iS an input specification that will yield key-value pairs. 
Each job consists of two stages: first, a user-defined map 
function is applied to each input record to produce a list 
of intermediate key-value pairs. Second, a user-defined 
reduce function is called once for each distinct key in 
the map output and passed the list of intermediate values 
associated with that key. The MapReduce framework au- 
tomatically parallelizes the execution of these functions 
and ensures fault tolerance. 

Optionally, the user can supply a combiner function [6]. 
Combiners are similar to reduce functions, except that 
they are not passed all the values for a given key: instead, 
a combiner emits an output value that summarizes the 
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public interface Mapper<Kl, V1, K2, V2> { 
void map(Kl key, V1 value, 


OutpurCcollector<k2, V2> output)? 


void close(); 


Figure 1: Map function interface. 


input values it was passed. Combiners are typically used 
to perform map-side “pre-aggregation,’” which reduces 
the amount of network traffic required between the map 
and reduce steps. 


2.2 Hadoop Architecture 


Hadoop is composed of Hadoop MapReduce, an imple- 
mentation of MapReduce designed for large clusters, and 
the Hadoop Distributed File System (HDFS), a file system 
optimized for batch-oriented workloads such as MapRe- 
duce. In most Hadoop jobs, HDFS is used to store both 
the input to the map step and the output of the reduce step. 
Note that HDFS is not used to store intermediate results 
(e.g., the output of the map step): these are kept on each 
node’s local file system. 

A Hadoop installation consists of a single master node 
and many worker nodes. The master, called the Job- 
Tracker, is responsible for accepting jobs from clients, 
dividing those jobs into tasks, and assigning those tasks 
to be executed by worker nodes. Each worker runs a Task- 
Tracker process that manages the execution of the tasks 
currently assigned to that node. Each TaskTracker has a 
fixed number of slots for executing tasks (two maps and 
two reduces by default). 


2.3. Map Task Execution 


Each map task is assigned a portion of the input file called 
a split. By default, a split contains a single HDFS block 
(64MB by default), so the total number of file blocks 
determines the number of map tasks. 

The execution of a map task is divided into two phases. 


1. The map phase reads the task’s split from HDFS, 
parses it into records (key/value pairs), and applies 
the map function to each record. 


2. After the map function has been applied to each 
input record, the commit phase registers the final 
output with the TaskTracker, which then informs the 
JobTracker that the task has finished executing. 


Figure | contains the interface that must be imple- 
mented by user-defined map functions. After the map 
function has been applied to each record in the split, the 
close method is invoked. 
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Figure 2: Map task index and data file format (2 parti- 
tion/reduce case). 


The third argument to the map method specifies an 
OutputCollector instance, which accumulates the output 
records produced by the map function. The output of the 
map step is consumed by the reduce step, so the Output- 
Collector stores map output in a format that is easy for 
reduce tasks to consume. Intermediate keys are assigned 
to reducers by applying a partitioning function, so the Out- 
putCollector applies that function to each key produced 
by the map function, and stores each record and partition 
number in an in-memory buffer. The OutputCollector 
spills this buffer to disk when it reaches capacity. 

A spill of the in-memory buffer involves first sorting 
the records in the buffer by partition number and then by 
key. The buffer content is written to the local file system 
as an index file and a data file (Figure 2). The index file 
points to the offset of each partition in the data file. The 
data file contains only the records, which are sorted by 
the key within each partition segment. 

During the commit phase, the final output of the map 
task is generated by merging all the spill files produced by 
this task into a single pair of data and index files. These 
files are registered with the TaskTracker before the task 
completes. The TaskTracker will read these files when 
servicing requests from reduce tasks. 


2.4 Reduce Task Execution 


The execution of a reduce task is divided into three phases. 


1. The shuffle phase fetches the reduce task’s input data. 
Each reduce task is assigned a partition of the key 
range produced by the map step, so the reduce task 
must fetch the content of this partition from every 
map task’s output. 


2. The sort phase groups records with the same key 
together. 


public interface Reducer<K2, V2, K3, V3> { 


void reduce (K2 key, Iterator<V2> values, 


OucpucColleccor<K3s, V3> oucpUuL); 


void chose ()+ 


Figure 3: Reduce function interface. 


3. The reduce phase applies the user-defined reduce 
function to each key and corresponding list of values. 


In the shuffle phase, a reduce task fetches data from 
each map task by issuing HTTP requests to a configurable 
number of TaskTrackers at once (5 by default). The Job- 
Tracker relays the location of every TaskTracker that hosts 
map output to every TaskTracker that is executing a re- 
duce task. Note that a reduce task cannot fetch the output 
of a map task until the map has finished executing and 
committed its final output to disk. 

After receiving its partition from all map outputs, the 
reduce task enters the sort phase. The map output for 
each partition is already sorted by the reduce key. The 
reduce task merges these runs together to produce a sin- 
gle run that is sorted by key. The task then enters the 
reduce phase, in which it invokes the user-defined reduce 
function for each distinct key in sorted order, passing it 
the associated list of values. The output of the reduce 
function is written to a temporary location on HDFS. Af- 
ter the reduce function has been applied to each key in 
the reduce task’s partition, the task’s HDFS output file 
is atomically renamed from its temporary location to its 
final location. 

In this design, the output of both map and reduce tasks 
is written to disk before it can be consumed. This is par- 
ticularly expensive for reduce tasks, because their output 
is written to HDFS. Output materialization simplifies 
fault tolerance, because it reduces the amount of state that 
must be restored to consistency after a node failure. If any 
task (either map or reduce) fails, the JobTracker simply 
schedules a new task to perform the same work as the 
failed task. Since a task never exports any data other than 
its final answer, no further recovery steps are needed. 


3 Pipelined MapReduce 


In this section we discuss our extensions to Hadoop to sup- 
port pipelining between tasks (Section 3.1) and between 
jobs (Section 3.2). We describe how our design supports 
fault tolerance (Section 3.3), and discuss the interaction 
between pipelining and task scheduling (Section 3.4). Our 
focus here is on batch-processing workloads; we discuss 
online aggregation and continuous queries in Section 4 
and Section 5. We defer performance results to Section 6. 


NSDI ’10: 7th USENIX Symposium on Networked Systems Design and Implementation 315 


316 


3.1 Pipelining Within A Job 


As described in Section 2.4, reduce tasks traditionally 
issue HTTP requests to pull their output from each Task- 
Tracker. This means that map task execution is com- 
pletely decoupled from reduce task execution. To support 
pipelining, we modified the map task to instead push data 
to reducers as it is produced. To give an intuition for 
how this works, we begin by describing a straightforward 
pipelining design, and then discuss the changes we had to 
make to achieve good performance. 


3.1.1 Naive Pipelining 


In our naive implementation, we modified Hadoop to send 
data directly from map to reduce tasks. When a client 
submits a new job to Hadoop, the JobTracker assigns 
the map and reduce tasks associated with the job to the 
available TaskTracker slots. For purposes of discussion, 
we assume that there are enough free slots to assign all 
the tasks for each job. We modified Hadoop so that each 
reduce task contacts every map task upon initiation of 
the job, and opens a TCP socket which will be used to 
pipeline the output of the map function. As each map 
output record is produced, the mapper determines which 
partition (reduce task) the record should be sent to, and 
immediately sends it via the appropriate socket. 

A reduce task accepts the pipelined data it receives 
from each map task and stores it in an in-memory buffer, 
spilling sorted runs of the buffer to disk as needed. Once 
the reduce task learns that every map task has completed, 
it performs a final merge of all the sorted runs and applies 
the user-defined reduce function as normal. 


3.1.2 Refinements 


While the algorithm described above is straightforward, 
it suffers from several practical problems. First, it is 
possible that there will not be enough slots available to 
schedule every task in a new job. Opening a socket be- 
tween every map and reduce task also requires a large 
number of TCP connections. A simple tweak to the naive 
design solves both problems: if a reduce task has not 
yet been scheduled, any map tasks that produce records 
for that partition simply write them to disk. Once the 
reduce task is assigned a slot, it can then pull the records 
from the map task, as in regular Hadoop. To reduce the 
number of concurrent TCP connections, each reducer can 
be configured to pipeline data from a bounded number 
of mappers at once; the reducer will pull data from the 
remaining map tasks in the traditional Hadoop manner. 
Our initial pipelining implementation suffered from a 
second problem: the map function was invoked by the 
same thread that wrote output records to the pipeline sock- 
ets. This meant that if a network I/O operation blocked 
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(e.g., because the reducer was over-utilized), the mapper 
was prevented from doing useful work. Pipeline stalls 
should not prevent a map task from making progress— 
especially since, once a task has completed, it frees a 
TaskTracker slot to be used for other purposes. We solved 
this problem by running the map function in a separate 
thread that stores its output in an in-memory buffer, and 
then having another thread periodically send the contents 
of the buffer to the connected reducers. 


3.1.3 Granularity of Map Output 


Another problem with the naive design is that it eagerly 
sends each record as soon as it is produced, which pre- 
vents the use of map-side combiners. Imagine a job where 
the reduce key has few distinct values (e.g., gender), and 
the reduce applies an aggregate function (e.g., count). As 
discussed in Section 2.1, combiners allow map-side “pre- 
aggregation”: by applying a reduce-like function to each 
distinct key at the mapper, network traffic can often be 
substantially reduced. Eagerly pipelining each record as 
it is produced prevents the use of map-side combiners. 

A related problem is that eager pipelining moves some 
of the sorting work from the mapper to the reducer. Re- 
call that in the blocking architecture, map tasks generate 
sorted spill files: all the reduce task must do is merge to- 
gether the pre-sorted map output for each partition. In the 
naive pipelining design, map tasks send output records 
in the order in which they are generated, so the reducer 
must perform a full external sort. Because the number of 
map tasks typically far exceeds the number of reduces [6], 
moving more work to the reducer increased response time 
in our experiments. 

We addressed these issues by modifying the in-memory 
buffer design described in Section 3.1.2. Instead of send- 
ing the buffer contents to reducers directly, we wait for 
the buffer to grow to a threshold size. The mapper then 
applies the combiner function, sorts the output by parti- 
tion and reduce key, and writes the buffer to disk using 
the spill file format described in Section 2.3. 

Next, we arranged for the TaskTracker at each node to 
handle pipelining data to reduce tasks. Map tasks register 
spill files with the TaskTracker via RPCs. If the reducers 
are able to keep up with the production of map outputs and 
the network is not a bottleneck, a spill file will be sent to 
a reducer soon after it has been produced (in which case, 
the spill file is likely still resident in the map machine’s 
kernel buffer cache). However, if a reducer begins to fall 
behind, the number of unsent spill files will grow. 

When a map task generates a new spill file, it first 
queries the TaskTracker for the number of unsent spill 
files. If this number grows beyond a certain threshold 
(two unsent spill files in our experiments), the map task 
does not immediately register the new spill file with the 
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TaskTracker. Instead, the mapper will accumulate multi- 
ple spill files. Once the queue of unsent spill files falls 
below the threshold, the map task merges and combines 
the accumulated spill files into a single file, and then re- 
sumes registering its output with the TaskTracker. This 
simple flow control mechanism has the effect of adap- 
tively moving load from the reducer to the mapper or vice 
versa, depending on which node is the current bottleneck. 


A similar mechanism is also used to control how ag- 
gressively the combiner function is applied. The map task 
records the ratio between the input and output data sizes 
whenever it invokes the combiner function. If the com- 
biner is effective at reducing data volumes, the map task 
accumulates more spill files (and applies the combiner 
function to all of them) before registering that output with 
the TaskTracker for pipelining.” 


The connection between pipelining and adaptive query 
processing techniques has been observed elsewhere 
(e.g., [2]). The adaptive scheme outlined above is rel- 
atively simple, but we believe that adapting to feedback 
along pipelines has the potential to significantly improve 
the utilization of MapReduce clusters. 


3.2 Pipelining Between Jobs 


Many practical computations cannot be expressed as a 
single MapReduce job, and the outputs of higher-level 
languages like Pig [20] typically involve multiple jobs. In 
the traditional Hadoop architecture, the output of each job 
is written to HDFS in the reduce step and then immedi- 
ately read back from HDFS by the map step of the next 
job. Furthermore, the JobTracker cannot schedule a con- 
sumer job until the producer job has completed, because 
scheduling a map task requires knowing the HDFS block 
locations of the map’s input split. 


In our modified version of Hadoop, the reduce tasks of 
one job can optionally pipeline their output directly to the 
map tasks of the next job, sidestepping the need for ex- 
pensive fault-tolerant storage in HDFS for what amounts 
to a temporary file. Unfortunately, the computation of 
the reduce function from the previous job and the map 
function of the next job cannot be overlapped: the final 
result of the reduce step cannot be produced until all map 
tasks have completed, which prevents effective pipelining. 
However, in the next sections we describe how online 
aggregation and continuous query pipelines can publish 
“snapshot” outputs that can indeed pipeline between jobs. 


Our current prototype uses a simple heuristic: if the combiner 
reduces data volume by 2 on average, we wait until & spill files have 
accumulated before registering them with the TaskTracker. A better 
heuristic would also account for the computational cost of applying the 
combiner function. 


3.3. Fault Tolerance 


Our pipelined Hadoop implementation is robust to the 
failure of both map and reduce tasks. To recover from 
map task failures, we added bookkeeping to the reduce 
task to record which map task produced each pipelined 
spill file. To simplify fault tolerance, the reducer treats 
the output of a pipelined map task as “tentative” until 
the JobTracker informs the reducer that the map task has 
committed successfully. The reducer can merge together 
spill files generated by the same uncommitted mapper, 
but will not combine those spill files with the output of 
other map tasks until it has been notified that the map task 
has committed. Thus, if a map task fails, each reduce task 
can ignore any tentative spill files produced by the failed 
map attempt. The JobTracker will take care of scheduling 
anew map task attempt, as in stock Hadoop. 

If a reduce task fails and a new copy of the task is 
started, the new reduce instance must be sent all the input 
data that was sent to the failed reduce attempt. If map 
tasks operated in a purely pipelined fashion and discarded 
their output after sending it to a reducer, this would be 
difficult. Therefore, map tasks retain their output data on 
the local disk for the complete job duration. This allows 
the map’s output to be reproduced if any reduce tasks fail. 
For batch jobs, the key advantage of our architecture is 
that reducers are not blocked waiting for the complete 
output of the task to be written to disk. 

Our technique for recovering from map task failure 1s 
straightforward, but places a minor limit on the reducer’s 
ability to merge spill files. To avoid this, we envision 
introducing a “checkpoint” concept: as a map task runs, it 
will periodically notify the JobTracker that it has reached 
offset x in its input split. The JobTracker will notify any 
connected reducers; map task output that was produced 
before offset « can then be merged by reducers with other 
map task output as normal. To avoid duplicate results, 
if the map task fails, the new map task attempt resumes 
reading its input at offset x. This technique would also 
reduce the amount of redundant work done after a map 
task failure or during speculative execution of “backup” 
tasks [6]. 


3.4 Task Scheduling 


The Hadoop JobTracker had to be retrofitted to support 
pipelining between jobs. In regular Hadoop, job are sub- 
mitted one at a time; a job that consumes the output of 
one or more other jobs cannot be submitted until the pro- 
ducer jobs have completed. To address this, we modified 
the Hadoop job submission interface to accept a list of 
jobs, where each job in the list depends on the job before 
it. The client interface traverses this list, annotating each 
job with the identifier of the job that it depends on. The 


NSDI ’10: 7th USENIX Symposium on Networked Systems Design and Implementation 317 


318 


JobTracker looks for this annotation and co-schedules 
jobs with their dependencies, giving slot preference to 
“upstream” jobs over the “downstream” jobs they feed. As 
we note in Section 8, there are many interesting options 
for scheduling pipelines or even DAGs of such jobs that 
we plan to investigate in future. 


4 Online Aggregation 


Although MapReduce was originally designed as a batch- 
oriented system, it is often used for interactive data analy- 
sis: a user submits a job to extract information from a data 
set, and then waits to view the results before proceeding 
with the next step in the data analysis process. This trend 
has accelerated with the development of high-level query 
languages that are executed as MapReduce jobs, such as 
Hive [27], Pig [20], and Sawzall [23]. 

Traditional MapReduce implementations provide a 
poor interface for interactive data analysis, because they 
do not emit any output until the job has been executed 
to completion. In many cases, an interactive user would 
prefer a “quick and dirty” approximation over a correct an- 
swer that takes much longer to compute. In the database 
literature, online aggregation has been proposed to ad- 
dress this problem [12], but the batch-oriented nature 
of traditional MapReduce implementations makes these 
techniques difficult to apply. In this section, we show how 
we extended our pipelined Hadoop implementation to sup- 
port online aggregation within a single job (Section 4.1) 
and between multiple jobs (Section 4.2). In Section 4.3, 
we evaluate online aggregation on two different data sets, 
and show that it can yield an accurate approximate answer 
long before the job has finished executing. 


4.1 Single-Job Online Aggregation 


In HOP, the data records produced by map tasks are sent 
to reduce tasks shortly after each record is generated. 
However, to produce the final output of the job, the reduce 
function cannot be invoked until the entire output of every 
map task has been produced. We can support online 
aggregation by simply applying the reduce function to 
the data that a reduce task has received so far. We call 
the output of such an intermediate reduce operation a 
snapshot. 

Users would like to know how accurate a snapshot 
is: that is, how closely a snapshot resembles the final 
output of the job. Accuracy estimation is a hard problem 
even for simple SQL queries [15], and particularly hard 
for jobs where the map and reduce functions are opaque 
user-defined code. Hence, we report job progress, not 
accuracy: we leave it to the user (or their MapReduce 
code) to correlate progress to a formal notion of accuracy. 
We give a simple progress metric below. 
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Snapshots are computed periodically, as new data ar- 
rives at each reducer. The user specifies how often snap- 
shots should be computed, using the progress metric as 
the unit of measure. For example, a user can request that 
a snapshot be computed when 25%, 50%, and 75% of the 
input has been seen. The user may also specify whether to 
include data from tentative (unfinished) map tasks. This 
option does not affect the fault tolerance design described 
in Section 3.3. In the current prototype, each snapshot is 
stored in a directory on HDFS. The name of the directory 
includes the progress value associated with the snapshot. 
Each reduce task runs independently, and at a different 
rate. Once a reduce task has made sufficient progress, it 
writes a snapshot to a temporary directory on HDFS, and 
then atomically renames it to the appropriate location. 

Applications can consume snapshots by polling HDFS 
in a predictable location. An application knows that a 
given snapshot has been completed when every reduce 
task has written a file to the snapshot directory. Atomic 
rename is used to avoid applications mistakenly reading 
incomplete snapshot files. 

Note that if there are not enough free slots to allow all 
the reduce tasks in a job to be scheduled, snapshots will 
not be available for reduce tasks that are still waiting to 
be executed. The user can detect this situation (e.g., by 
checking for the expected number of files in the HDFS 
snapshot directory), so there is no risk of incorrect data, 
but the usefulness of online aggregation will be reduced. 
In the current prototype, we manually configured the 
cluster to avoid this scenario. The system could also 
be enhanced to avoid this pitfall entirely by optionally 
waiting to execute an online aggregation job until there 
are enough reduce slots available. 


4.1.1 Progress Metric 


Hadoop provides support for monitoring the progress of 
task executions. As each map task executes, it is assigned 
a progress score in the range [0,1], based on how much 
of its input the map task has consumed. We reused this 
feature to determine how much progress is represented 
by the current input to a reduce task, and hence to decide 
when a new snapshot should be taken. 

First, we modified the spill file format depicted in Fig- 
ure 2 to include the map’s current progress score. When a 
partition in a spill file is sent to a reducer, the spill file’s 
progress score is also included. To compute the progress 
score for a snapshot, we take the average of the progress 
scores associated with each spill file used to produce the 
snapshot. 

Note that it is possible that a map task might not have 
pipelined any output to a reduce task, either because the 
map task has not been scheduled yet (there are no free 
TaskTracker slots), the map tasks does not produce any 
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output for the given reduce task, or because the reduce 
task has been configured to only pipeline data from at 
most & map tasks concurrently. To account for this, we 
need to scale the progress metric to reflect the portion of 
the map tasks that a reduce task has pipelined data from: 
if a reducer is connected to of the total number of map 
tasks in the job, we divide the average progress score by 
n. 

This progress metric could easily be made more sophis- 
ticated: for example, an improved metric might include 
the selectivity (joutput|/|input|) of each map task, the 
statistical distribution of the map task’s output, and the 
effectiveness of each map task’s combine function, if any. 
Although we have found our simple progress metric to be 
sufficient for most experiments we describe below, this 
clearly represents an opportunity for future work. 


4.2 Miulti-Job Online Aggregation 


Online aggregation is particularly useful when applied 
to a long-running analysis task composed of multiple 
MapReduce jobs. As described in Section 3.2, our version 
of Hadoop allows the output of a reduce task to be sent 
directly to map tasks. This feature can be used to support 
online aggregation for a sequence of jobs. 

Suppose that 7; and jg are two MapReduce jobs, and 72 
consumes the output of 7;. When 7;’s reducers compute 
a snapshot to perform online aggregation, that snapshot is 
written to HDFS, and also sent directly to the map tasks of 
42. The map and reduce steps for j2 are then computed as 
normal, to produce a snapshot of 72’s output. This process 
can then be continued to support online aggregation for 
an arbitrarily long sequence of jobs. 

Unfortunately, inter-job online aggregation has some 
drawbacks. First, the output of a reduce function is not 
‘monotonic’: the output of a reduce function on the first 
50% of the input data may not be obviously related to 
the output of the reduce function on the first 25%. Thus, 
as new snapshots are produced by j;, 72 must be recom- 
puted from scratch using the new snapshot. As with 
inter-job pipelining (Section 3.2), this could be optimized 
for reduce functions that are declared to be distributive or 
algebraic aggregates [9]. 

To support fault tolerance for multi-job online aggrega- 
tion, we consider three cases. Tasks that fail in 7; recover 
as described in Section 3.3. If a task in 79 fails, the system 
simply restarts the failed task. Since subsequent snapshots 
produced by 7; are taken from a superset of the mapper 
output in 7;, the next snapshot received by the restarted 
reduce task in jg will have a higher progress score. To 
handle failures in 7;, tasks in 72 cache the most recent 
snapshot received by 7;, and replace it when they receive 
a new snapshot with a higher progress metric. If tasks 
from both jobs fail, a new task in jg recovers the most 
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Figure 4: Top-100 query over 5.5GB of Wikipedia article 
text. The vertical lines describe the increasing accuracy of 
the approximate answers produced by online aggregation. 


recent snapshot from 7; that was stored in HDFS and then 
wait for snapshots with a higher progress score. 


4.3 Evaluation 


To evaluate the effectiveness of online aggregation, we 
performed two experiments on Amazon EC2 using differ- 
ent data sets and query workloads. In our first experiment, 
we wrote a “Top- AK” query using two MapReduce jobs: 
the first job counts the frequency of each word and the 
second job selects the K most frequent words. We ran 
this workload on 5.5GB of Wikipedia article text stored 
in HDFS, using a 128MB block size. We used a 60-node 
EC2 cluster; each node was a “high-CPU medium” EC2 
instance with 1.7GB of RAM and 2 virtual cores. A vir- 
tual core is the equivalent of a 2007-era 2.5Ghz Intel Xeon 
processor. A single EC2 node executed the Hadoop Job- 
Tracker and the HDFS NameNode, while the remaining 
nodes served as slaves for running the TaskTrackers and 
HDEFS DataNodes. 

Figure 4 shows the results of inter-job online aggrega- 
tion for a Top-100 query. Our accuracy metric for this 
experiment is post-hoc — we note the time at which the 
Top-/ words in the snapshot are the Top-K words in the 
final result. Although the final result for this job did not 
appear until nearly the end, we did observe the Top-5, 10, 
and 20 values at the times indicated in the graph. The 
Wikipedia data set was biased toward these Top-K words 
(e.g., “the’’, “1s”, etc.), which remained in their correct 
position throughout the lifetime of the job. 


4.3.1 Approximation Metrics 


In our second experiment, we considered the effectiveness 
of the job progress metric described in Section 4.1.1. Un- 
surprisingly, this metric can be inaccurate when it is used 
to estimate the accuracy of the approximate answers pro- 
duced by online aggregation. In this experiment, we com- 
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Figure 5: Comparison of two approximation metrics. Figure (a) shows the relative error for each approximation metric 
over the runtime of the job, averaged over all groups. Figure (b) compares an example approximate answer produced by 
each metric with the final answer, for each language and for a single hour. 


pared the job progress metric with a simple user-defined 
metric that leverages knowledge of the query and data 
set. HOP allows such metrics, although developing such 
a custom metric imposes more burden on the programmer 
than using the generic progress-based metric. 


We used a data set containing seven months of hourly 
page view statistics for more than 2.5 million Wikipedia 
articles [26]. This constituted 320GB of compressed data 
(1TB uncompressed), divided into 5066 compressed files. 
We stored the data set on HDFS and assigned a single 
map task to each file, which was decompressed before the 
map function was applied. 


We wrote a MapReduce job to count the total number of 
page views for each language and each hour of the day. In 
other words, our query grouped by language and hour of 
day, and summed the number of page views that occurred 
in each group. To enable more accurate approximate 
answers, we modified the map function to include the 
fraction of a given hour that each record represents. The 
reduce function summed these fractions for a given hour, 
which equated to one for all records from a single map 
task. Since the total number of hours was known ahead 
of time, we could use the result of this sum over all map 
outputs to determine the total fraction of each hour that 
had been sampled. We call this user-defined metric the 
“sample fraction.” 


To compute approximate answers, each intermediate re- 
sult was scaled up using two different metrics: the generic 
metric based on job progress and the sample fraction de- 
scribed above. Figure 5a reports the relative error of the 
two metrics, averaged over all groups. Figure 5b shows 
an example approximate answer for a single hour using 
both metrics (computed two minutes into the job runtime). 
This figure also contains the final answer for comparison. 
Both results indicate that the sample fraction metric pro- 


NSDI 710: 7th USENIX Symposium on Networked Systems Design and Implementation 


vides a much more accurate approximate answer for this 
query than the progress-based metric. 

Job progress is clearly the wrong metric to use for ap- 
proximating the final answer of this query. The primary 
reason is that it is too coarse of a metric. Each interme- 
diate result was computed from some fraction of each 
hour. However, the job progress assumes that this fraction 
is uniform across all hours, when in fact we could have 
received much more of one hour and much less of another. 
This assumption of uniformity in the job progress resulted 
in a significant approximation error. By contrast, the sam- 
ple fraction scales the approximate answer for each group 
according to the actual fraction of data seen for that group, 
yielding much more accurate approximations. 


5 Continuous Queries 


MapReduce is often used to analyze streams of constantly- 
arriving data, such as URL access logs [6] and system 
console logs [30]. Because of traditional constraints on 
MapReduce, this is done in large batches that can only 
provide periodic views of activity. This introduces sig- 
nificant latency into a data analysis process that ideally 
should run in near-real time. It is also potentially inef- 
ficient: each new MapReduce job does not have access 
to the computational state of the last analysis run, so this 
state must be recomputed from scratch. The programmer 
can manually save the state of each job and then reload it 
for the next analysis operation, but this is labor-intensive. 

Our pipelined version of Hadoop allows an alternative 
architecture: MapReduce jobs that run continuously, ac- 
cepting new data as it becomes available and analyzing it 
immediately. This allows for near-real-time analysis of 
data streams, and thus allows the MapReduce program- 
ming model to be applied to domains such as environment 
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monitoring and real-time fraud detection. 

In this section, we describe how HOP supports contin- 
uous MapReduce jobs, and how we used this feature to 
implement a rudimentary cluster monitoring tool. 


5.1 Continuous MapReduce Jobs 


A bare-bones implementation of continuous MapReduce 
jobs is easy to implement using pipelining. No changes 
are needed to implement continuous map tasks: map 
output is already delivered to the appropriate reduce task 
shortly after it is generated. We added an optional “flush” 
API that allows map functions to force their current output 
to reduce tasks. When a reduce task is unable to accept 
such data, the mapper framework stores it locally and 
sends it at a later time. With proper scheduling of reducers, 
this API allows a map task to ensure that an output record 
is promptly sent to the appropriate reducer. 

To support continuous reduce tasks, the user-defined 
reduce function must be periodically invoked on the map 
output available at that reducer. Applications will have 
different requirements for how frequently the reduce func- 
tion should be invoked; possible choices include periods 
based on wall-clock time, logical time (e.g., the value of a 
field in the map task output), and the number of input rows 
delivered to the reducer. The output of the reduce func- 
tion can be written to HDFS, as in our implementation of 
online aggregation. However, other choices are possible; 
our prototype system monitoring application (described 
below) sends an alert via email if an anomalous situation 
is detected. 

In our current implementation, the number of map and 
reduce tasks 1s fixed, and must be configured by the user. 
This is clearly problematic: manual configuration 1s error- 
prone, and many stream processing applications exhibit 
“bursty” traffic patterns, in which peak load far exceeds 
average load. In the future, we plan to add support for 
elastic scaleup/scaledown of map and reduce tasks in 
response to variations in load. 


5.1.1 Fault Tolerance 


In the checkpoint/restart fault-tolerance model used by 
Hadoop, mappers retain their output until the end of the 
job to facilitate fast recovery from reducer failures. In a 
continuous query context, this is infeasible, since map- 
per history is in principle unbounded. However, many 
continuous reduce functions (e.g., 30-second moving av- 
erage) only require a suffix of the map output stream. This 
common case can be supported easily, by extending the 
JobTracker interface to capture a rolling notion of reducer 
consumption. Map-side spill files are maintained in a ring 
buffer with unique IDs for spill files over time. When a 
reducer commits an output to HDFS, it informs the Job- 


Tracker about the run of map output records it no longer 
needs, identifying the run by spill file IDs and offsets 
within those files. The JobTracker can then tell mappers 
to garbage collect the appropriate data. 

In principle, complex reducers may depend on very 
long (or infinite) histories of map records to accurately 
reconstruct their internal state. In that case, deleting spill 
files from the map-side ring buffer will result in poten- 
tially inaccurate recovery after faults. Such scenarios 
can be handled by having reducers checkpoint internal 
state to HDFS, along with markers for the mapper off- 
sets at which the internal state was checkpointed. The 
MapReduce framework can be extended with APIs to help 
with state serialization and offset management, but it still 
presents a programming burden on the user to correctly 
identify the sensitive internal state. That burden can be 
avoided by more heavyweight process-pair techniques 
for fault tolerance, but those are quite complex and use 
significant resources [24]. In our work to date we have 
focused on cases where reducers can be recovered from a 
reasonable-sized history at the mappers, favoring minor 
extensions to the simple fault-tolerance approach used in 
Hadoop. 


5.2 Prototype Monitoring System 


Our monitoring system is composed of agents that run on 
each monitored machine and record statistics of interest 
(e.g., load average, I/O operations per second, etc.). Each 
agent is implemented as a continuous map task: rather 
than reading from HDFS, the map task instead reads from 
various system-local data streams (e.g., /proc). 

Each agent forwards statistics to an aggregator that is 
implemented as a continuous reduce task. The aggregator 
records how agent-local statistics evolve over time (e.g., 
by computing windowed-averages), and compares statis- 
tics between agents to detect anomalous behavior. Each 
aggregator monitors the agents that report to it, but might 
also report statistical summaries to another “upstream” 
aggregator. For example, the system might be configured 
to have an aggregator for each rack and then a second 
level of aggregators that compare statistics between racks 
to analyze datacenter-wide behavior. 


5.3. Evaluation 


To validate our prototype system monitoring tool, we con- 
structed a scenario in which one member of a MapReduce 
cluster begins thrashing during the execution of a job. Our 
goal was to test how quickly our monitoring system would 
detect this behavior. The basic mechanism is similar to an 
alert system one of the authors implemented at an Internet 
search company. 
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Figure 6: Number of pages swapped over time on the 
thrashing host, as reported by vmstat. The vertical 
line indicates the time at which the alert was sent by the 
monitoring system. 


We used a simple load metric (a linear combination of 
CPU utilization, paging, and swap activity). The continu- 
ous reduce function maintains windows over samples of 
this metric: at regular intervals, it compares the 20 second 
moving average of the load metric for each host to the 
120 second moving average of all the hosts in the cluster 
except that host. If the given host’s load metric is more 
than two standard deviations above the global average, it 
is considered an outlier and a tentative alert is issued. To 
dampen false positives in “bursty” load scenarios, we do 
not issue an alert until we have received 10 tentative alerts 
within a time window. 


We deployed this system on an EC2 cluster consisting 
of 7 “large” nodes (large nodes were chosen because EC2 
allocates an entire physical host machine to them). We 
ran a wordcount job on the 5.5GB Wikipedia data set, 
using 5 map tasks and 2 reduce tasks (1 task per host). 
After the job had been running for about 10 seconds, we 
selected a node running a task and launched a program 
that induced thrashing. 


We report detection latency in Figure 6. The vertical 
bar indicates the time at which the monitoring tool fired a 
(non-tentative) alert. The thrashing host was detected very 
rapidly—notably faster than the 5-second TaskTracker- 
JobTracker heartbeat cycle that is used to detect straggler 
tasks in stock Hadoop. We envision using these alerts 
to do early detection of stragglers within a MapReduce 
job: HOP could make scheduling decisions for a job by 
running a secondary continuous monitoring query. Com- 
pared to out-of-band monitoring tools, this economy of 
mechanism—reusing the MapReduce infrastructure for 
reflective monitoring—has benefits in software mainte- 
nance and system management. 


NSDI 710: 7th USENIX Symposium on Networked Systems Design and Implementation 


6 Performance Evaluation 


A thorough performance comparison between pipelining 
and blocking is beyond the scope of this paper. In this 
section, we instead demonstrate that pipelining can reduce 
job completion times in some configurations. 

We report performance using both large (512MB) and 
small (32MB) HDES block sizes using a single workload 
(a wordcount job over randomly-generated text). Since 
the words were generated using a uniform distribution, 
map-side combiners were ineffective for this workload. 
We performed all experiments using relatively small clus- 
ters of Amazon EC2 nodes. We also did not consider 
performance in an environment where multiple concur- 
rent jobs are executing simultaneously. 


6.1 Background and Configuration 


Before diving into the performance experiments, it is im- 
portant to further describe the division of labor in a HOP 
job, which is broken into task phases. A map task consists 
of two work phases: map and sort. The majority of work 
is performed in the map phase, where the map function 
is applied to each record in the input and subsequently 
sent to an output buffer. Once the entire input has been 
processed, the map task enters the sort phase, where a 
final merge sort of all intermediate spill files is performed 
before registering the final output with the TaskTracker. 
The progress reported by a map task corresponds to the 
map phase only. 

A reduce task in HOP is divided into three work phases: 
shuffle, reduce, and commit. In the shuffle phase, reduce 
tasks receive their portion of the output from each map. 
In HOP, the shuffle phase consumes 75% of the overall 
reduce task progress while the remaining 25% is allocated 
to the reduce and commit phase.* In the shuffle phase, 
reduce tasks periodically perform a merge sort on the 
already received map output. These intermediate merge 
sorts decrease the amount of sorting work performed at 
the end of the shuffle phase. After receiving its portion of 
data from all map tasks, the reduce task performs a final 
merge sort and enters the reduce phase. 

By pushing work from map tasks to reduce tasks more 
aggressively, pipelining can enable better overlapping of 
map and reduce computation, especially when the node 
on which a reduce task is scheduled would otherwise be 
underutilized. However, when reduce tasks are already the 
bottleneck, pipelining offers fewer performance benefits, 
and may even hurt performance by placing additional load 
on the reduce nodes. 


3The stock version of Hadoop divides the reduce progress evenly 
among the three phases. We deviated from this approach because we 
wanted to focus more on the progress during the shuffle phase. 
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Figure 7: CDF of map and reduce task completion times for a 1|OGB wordcount job using 20 map tasks and 5 reduce 
tasks (512MB block size). The total job runtimes were 561 seconds for blocking and 462 seconds for pipelining. 
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Figure 8: CDF of map and reduce task completion times for a 1OGB wordcount job using 20 map tasks and 20 reduce 
tasks (512MB block size). The total job runtimes were 361 seconds for blocking and 290 seconds for pipelining. 


The sort phase in the map task minimizes the merging 
work that reduce tasks must perform at the end of the 
shuffle phase. When pipelining is enabled, the sort phase 
is avoided since map tasks have already sent some fraction 
of the spill files to concurrently running reduce tasks. 
Therefore, pipelining increases the merging workload 
placed on the reducer. The adaptive pipelining scheme 
described in Section 3.1.3 attempts to ensure that reduce 
tasks are not overwhelmed with additional load. 

We used two Amazon EC2 clusters depending on the 
size of the experiment: “small” jobs used 10 worker nodes, 
while “large” jobs used 20. Each node was an “extra large” 
EC2 instances with 15GB of memory and four virtual 
cores. 


6.2 Small Job Results 


Our first experiment focused on the performance of small 
jobs in an underutilized cluster. We ran a 10GB word- 
count with a 512MB block size, yielding 20 map tasks. 
We used 10 worker nodes and configured each worker 
to execute at most two map and two reduce tasks simul- 
taneously. We ran several experiments to compare the 


performance of blocking and pipelining using different 
numbers of reduce tasks. 


Figure 7 reports the results with five reduce tasks. A 
plateau can be seen at 75% progress for both blocking 
and pipelining. At this point in the job, all reduce tasks 
have completed the shuffle phase; the plateau is caused by 
the time taken to perform a final merge of all map output 
before entering the reduce phase. Notice that the plateau 
for the pipelining case is shorter. With pipelining, reduce 
tasks receive map outputs earlier and can begin sorting 
earlier, thereby reducing the time required for the final 
merge. 


Figure 8 reports the results with twenty reduce tasks. 
Using more reduce tasks decreases the amount of merging 
that any one reduce task must perform, which reduces the 
duration of the plateau at 75% progress. In the blocking 
case, the plateau is practically gone. 


Note that in both experiments, the map phase finishes 
faster with blocking than with pipelining. This is because 
pipelining allows reduce tasks to begin executing more 
quickly; hence, the reduce tasks compete for resources 
with the map tasks, causing the map phase to take slightly 
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Figure 9: CDF of map and reduce task completion times for a 1|OGB wordcount job using 20 map tasks and | reduce 
task (512MB block size). The total job runtimes were 29 minutes for blocking and 34 minutes for pipelining. 
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Figure 10: CDF of map and reduce task completion times for a 1OOGB wordcount job using 240 map tasks and 60 
reduce tasks (512MB block size). The total job runtimes were 48 minutes for blocking and 36 minutes for pipelining. 


longer. In this case, the increase in map duration is out- 
weighed by the increase in cluster utilization, resulting in 
shorter job completion times: pipelining reduced comple- 
tion time by 17.7% with 5 reducers and by 19.7% with 20 
reducers. 

Figure 9 describes an experiment in which we ran a 
10GB wordcount job using a single reduce task. This 
caused job completion times to increase dramatically for 
both pipelining and blocking, because of the extreme 
load placed on the reduce node. Pipelining delayed job 
completion by ~17%, which suggests that our simple 
adaptive flow control scheme (Section 3.1.3) was unable 
to move load back to the map tasks aggressively enough. 


6.3 Large Job Results 


Our second set of experiments focused on the perfor- 
mance of somewhat larger jobs. We increased the input 
size to [OOGB (from 1OGB) and the number of worker 
nodes to 20 (from 10). Each worker was configured to 
execute at most four map and three reduce tasks, which 
meant that at most 80 map and 60 reduce tasks could 
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execute at once. We conducted two sets of experimental 
runs, each run comparing blocking to pipelining using 
either large (512MB) or small (32MB) block sizes. We 
were interested in blocking performance with small block 
sizes because blocking can effectively emulate pipelining 
if the block size is small enough. 


Figure 10 reports the performance of a 100GB word- 
count job with 512MB blocks, which resulted in 240 map 
tasks, scheduled in three waves of 80 tasks each. The 
60 reduce tasks were coscheduled with the first wave of 
map tasks. In the blocking case, the reduce tasks began 
working as soon as they received the output of the first 
wave, which is why the reduce progress begins to climb 
around four minutes (well before the completion of all 
maps). Pipelining was able to achieve significantly better 
cluster utilization, and hence reduced job completion time 
by ~25%. 


Figure 11 reports the performance of blocking and 
pipelining using 32MB blocks. While the performance of 
pipelining remained similar, the performance of blocking 
improved considerably, but still trailed somewhat behind 
pipelining. Using block sizes smaller than 32MB did 
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Figure 11: CDF of map and reduce task completion times for a 1|OOGB wordcount job using 3120 map tasks and 60 
reduce tasks (32MB block size). The total job runtimes were 42 minutes for blocking and 34 minutes for pipelining. 


not yield a significant performance improvement in our 
experiments. 


7 Related Work 


The work in this paper relates to literature on parallel 
dataflow frameworks, online aggregation, and continuous 
query processing. 


7.1 Parallel Dataflow 


Dean and Ghemawat’s paper on Google’s MapReduce [6] 
has become a standard reference, and forms the basis of 
the open-source Hadoop implementation. As noted in Sec- 
tion 1, the Google MapReduce design targets very large 
clusters where the probability of worker failure or slow- 
down is high. This led to their elegant checkpoint/restart 
approach to fault tolerance, and their lack of pipelining. 
Our work extends the Google design to accommodate 
pipelining without significant modification to their core 
programming model or fault tolerance mechanisms. 

Dryad [13] is a data-parallel programming model and 
runtime that is often compared to MapReduce, supporting 
a more general model of acyclic dataflow graphs. Like 
MapReduce, Dryad puts disk materialization steps be- 
tween dataflow stages by default, breaking pipelines. The 
Dryad paper describes support for optionally ““encapsulat- 
ing” multiple asynchronous stages into a single process 
so they can pipeline, but this requires a more complicated 
programming interface. The Dryad paper explicitly men- 
tions that the system is targeted at batch processing, and 
not at scenarios like continuous queries. 

It has been noted that parallel database systems have 
long provided partitioned dataflow frameworks [21], 
and recent commercial databases have begun to offer 
MapReduce programming models on top of those frame- 
works [5, 10]. Most parallel database systems can pro- 


vide pipelined execution akin to our work here, but they 
use a more tightly coupled iterator and Exchange model 
that keeps producers and consumers rate-matched via 
queues, spreading the work of each dataflow stage across 
all nodes in the cluster [8]. This provides less schedul- 
ing flexibility than MapReduce and typically offers no 
tolerance to mid-query worker faults. Yang et al. re- 
cently proposed a scheme to add support for mid-query 
fault tolerance to traditional parallel databases, using a 
middleware-based approach that shares some similarities 
with MapReduce [31]. 

Logothetis and Yocum describe a MapReduce inter- 
face Over a continuous query system called Mortar that 
is similar in some ways to our work [16]. Like HOP, 
their mappers push data to reducers in a pipelined fashion. 
They focus on specific issues in efficient stream query pro- 
cessing, including minimization of work for aggregates 
in overlapping windows via special reducer APIs. They 
are not built on Hadoop, and explicitly sidestep issues in 
fault tolerance. 

Hadoop Streaming 1s part of the Hadoop distribution, 
and allows map and reduce functions to be expressed 
as UNIX shell command lines. It does not stream data 
through map and reduce phases in a pipelined fashion. 


7.2 Online Aggregation 


Online aggregation was originally proposed in the con- 
text of simple single-table SQL queries involving “Group 
By” aggregations, a workload quite similar to MapRe- 
duce [12]. The focus of the initial work was on providing 
not only “early returns” to these SQL queries, but also sta- 
tistically robust estimators and confidence interval metrics 
for the final result based on random sampling. These sta- 
tistical matters do not generalize to arbitrary MapReduce 
jobs, though our framework can support those that have 
been developed. Subsequently, online aggregation was ex- 
tended to handle join queries (via the Ripple Join method), 
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and the CONTROL project generalized the idea of online 
query processing to provide interactivity for data cleaning, 
data mining, and data visualization tasks [11]. That work 
was targeted at single-processor systems. Luo et al. devel- 
oped a partitioned-parallel variant of Ripple Join, without 
statistical guarantees on approximate answers [17]. 

In recent years, this topic has seen renewed interest, 
starting with Jermaine et al.’s work on the DBO sys- 
tem [15]. That effort includes more disk-conscious online 
join algorithms, as well as techniques for maintaining 
randomly-shuffled files to remove any potential for sta- 
tistical bias in scans [14]. Wu et al. describe a system 
for peer-to-peer online aggregation in a distributed hash 
table context [29]. The open programmability and fault- 
tolerance of MapReduce are not addressed significantly 
in prior work on online aggregation. 

An alternative to online aggregation combines precom- 
putation with sampling, storing fixed samples and sum- 
maries to provide small storage footprints and interactive 
performance [7]. An advantage of these techniques is that 
they are compatible with both pipelining and blocking 
models of MapReduce. The downside of these techniques 
is that they do not allow users to choose the query stop- 
ping points or time/accuracy trade-offs dynamically [11]. 


7.3 Continuous Queries 


In the last decade there was a great deal of work in the 
database research community on the topic of continuous 
queries over data streams, including systems such as Bo- 
realis [1], STREAM [18], and Telegraph [4]. Of these, 
Borealis and Telegraph [24] studied fault tolerance and 
load balancing across machines. In the Borealis context 
this was done for pipelined dataflows, but without parti- 
tioned parallelism: each stage (“operator”) of the pipeline 
runs serially on a different machine in the wide area, and 
fault tolerance deals with failures of entire operators [3]. 
SBON [22] is an overlay network that can be integrated 
with Borealis, which handles “operator placement” opti- 
mizations for these wide-area pipelined dataflows. 
Telegraph’s FLuX operator [24, 25] is the only work to 
our knowledge that addresses mid-stream fault-tolerance 
for dataflows that are both pipelined and partitioned in 
the style of HOP. FLuX (‘‘Fault-tolerant, Load-balanced 
eXchange’’) is a dataflow operator that encapsulates the 
shuffling done between stages such as map and reduce. It 
provides load-balancing interfaces that can migrate oper- 
ator state (e.g., reducer state) between nodes, while han- 
dling scheduling policy and changes to data-routing poli- 
cies [25]. For fault tolerance, FLuX develops a solution 
based on process pairs [24], which work redundantly to 
ensure that operator state is always being maintained live 
on multiple nodes. This removes any burden on the con- 
tinuous query programmer of the sort we describe in Sec- 
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tion 5. On the other hand, the FLuX protocol is far more 
complex and resource-intensive than our pipelined adap- 
tation of Google’s checkpoint/restart tolerance model. 


$8 Conclusion and Future Work 


MapReduce has proven to be a popular model for large- 
scale parallel programming. Our Hadoop Online Pro- 
totype extends the applicability of the model to pipelin- 
ing behaviors, while preserving the simple programming 
model and fault tolerance of a full-featured MapReduce 
framework. This provides significant new functionality, 
including “early returns” on long-running jobs via online 
aggregation, and continuous queries over streaming data. 
We also demonstrate benefits for batch processing: by 
pipelining both within and across jobs, HOP can reduce 
the time to job completion. 

In considering future work, scheduling is a topic that 
arises immediately. Stock Hadoop already has many de- 
grees of freedom in scheduling batch tasks across ma- 
chines and time, and the introduction of pipelining in 
HOP only increases this design space. First, pipeline par- 
allelism is a new option for improving performance of 
MapReduce jobs, but needs to be integrated intelligently 
with both intra-task partition parallelism and speculative 
redundant execution for “‘straggler’” handling. Second, the 
ability to schedule deep pipelines with direct communica- 
tion between reduces and maps (bypassing the distributed 
file system) opens up new opportunities and challenges in 
carefully co-locating tasks from different jobs, to avoid 
communication when possible. 

Olston and colleagues have noted that MapReduce 
systems—unlike traditional databases—employ “model- 
light” optimization approaches that gather and react to 
performance information during runtime [19]. The con- 
tinuous query facilities of HOP enable powerful intro- 
spective programming interfaces for this: a full-featured 
MapReduce interface can be used to script performance 
monitoring tasks that gather system-wide information in 
near-real-time, enabling tight feedback loops for schedul- 
ing and dataflow optimization. This is a topic we plan to 
explore, including opportunistic methods to do monitor- 
ing work with minimal interference to outstanding jobs, 
as well as dynamic approaches to continuous optimization 
in the spirit of earlier work like Eddies [2] and FLuX [25]. 

As amore long-term agenda, we want to explore using 
MapReduce-style programming for even more interac- 
tive applications. As a first step, we hope to revisit in- 
teractive data processing in the spirit of the CONTROL 
work [11], with an eye toward improved scalability via 
parallelism. More aggressively, we are considering the 
idea of bridging the gap between MapReduce dataflow 
programming and lightweight event-flow programming 
models like SEDA [28]. Our HOP implementation’s roots 
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in Hadoop make it unlikely to compete with something 
like SEDA in terms of raw performance. However, it 
would be interesting to translate ideas across these two 
traditionally separate programming models, perhaps with 
an eye toward building a new and more general-purpose 
framework for programming in architectures like cloud 
computing and many-core. 
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Abstract 


Many Web services operate their own Web crawlers 
to discover data of interest, despite the fact that large- 
scale, timely crawling is complex, operationally inten- 
sive, and expensive. In this paper, we introduce the ex- 
tensible crawler, a service that crawls the Web on be- 
half of its many client applications. Clients inject filters 
into the extensible crawler; the crawler evaluates all re- 
ceived filters against each Web page, notifying clients of 
matches. As a result, the act of crawling the Web is de- 
coupled from determining whether a page is of interest, 
shielding client applications from the burden of crawling 
the Web themselves. 

This paper describes the architecture, implementa- 
tion, and evaluation of our prototype extensible crawler, 
and also relates early experience from several crawler 
applications we have built. We focus on the challenges 
and trade-offs in the system, such as the design of a filter 
language that is simultaneously expressive and efficient 
to execute, the use of filter indexing to cheaply match a 
page against millions of filters, and the use of document 
and filter partitioning to scale our prototype implemen- 
tation to high document throughput and large numbers 
of filters. We argue that the low-latency, high selectiv- 
ity, and scalable nature of our system makes it a promis- 
ing platform for taking advantage of emerging real-time 
streams of data, such as Facebook or Twitter feeds. 


1 Introduction 


Over the past decade, an astronomical amount of in- 
formation has been published on the Web. As well, 
Web services such as Twitter, Facebook, and Digg re- 
flect a growing trend to provide people and applica- 
tions with access to real-time streams of information 
updates. Together, these two characteristics imply that 
the Web has become an exceptionally potent reposi- 
tory of programmatically accessible data. Some of the 
most provocative recent Web applications are those that 
gather and process large-scale Web data, such as virtual 
tourism [33], knowledge extraction [15], Web site trust 


assessment [24], and emerging trend detection [6]. 


New Web services that want to take advantage of 
Web-scale data face a high barrier to entry. Finding and 
accessing data of interest requires crawling the Web, and 
if a service is sensitive to quick access to newly pub- 
lished data, its Web crawl must operate continuously and 
focus on the most relevant subset of the Web. Unfortu- 
nately, massive-scale, timely web crawling is complex, 
operationally intensive, and expensive. Worse, for ser- 
vices that are only interested in specific subsets of Web 
data, crawling is wasteful, as most pages retrieved will 
not match their criteria of interest. 


In this paper, we introduce the extensible crawler, a 
utility service that crawls the Web on behalf of its many 
client applications. An extensible crawler lets clients 
specify filters that are evaluated over each crawled Web 
page; if a page matches one of the filters specified by 
a client, the client is notified of the match. As a re- 
sult, the act of crawling the Web is decoupled from 
the application-specific logic of determining if a page is 
of interest, shielding Web-crawler applications from the 
burden of crawling the Web themselves. 


We anticipate two deployment modes for an extensi- 
ble crawler. First, it can run as a service accessible re- 
motely across the wide-area Internet. In this scenario, 
filter sets must be very highly selective, since the band- 
width between the extensible crawler and a client appli- 
cation is scarce and expensive. Second, it can run as a 
utility service [17] within cloud computing infrastructure 
such as Amazon’s EC2 or Google’s AppEngine. Filters 
can be much less selective in this scenario, since band- 
width between the extensible crawler and its clients is 
abundant, and the clients can pay to scale up the compu- 
tation processing selected documents. 


This paper describes our experience with the design, 
implementation, and evaluation of an extensible crawler, 
focusing on the challenges and trade-offs inherent in this 
class of system. For example, an extensible crawler’s fil- 
ter language must be sufficiently expressive to support 
interesting applications, but simultaneously, filters must 
be efficient to execute. A naive implementation of an ex- 
tensible crawler would require computational resources 
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proportional to the number of filters 1t supports multi- 
plied by its crawl rate; instead, our extensible crawler 
prototype uses standard indexing techniques to vastly re- 
duce the cost of executing a large number of filters. To 
scale, an extensible crawler must be distributed across 
a cluster. Accordingly, the system must balance load 
(both filters and pages) appropriately across machines, 
otherwise an overloaded machine will limit the rate at 
which the entire system can process crawled pages. Fi- 
nally, there must be appropriate mechanisms in place to 
allow web-crawler applications to update their filter sets 
frequently and efficiently. 

We demonstrate that XCrawler, our early prototype 
system, is scalable across several dimensions: it can ef- 
ficiently process tens of millions of concurrent filters 
while processing thousands of Web pages per second. 
XCrawler is also flexible. By construction, we show 
that its filter specification language facilitates a wide 
range of interesting web-crawler applications, including 
keyword-based notification, Web malware detection and 
defense, and copyright violation detection. 

An extensible crawler bears similarities to sev- 
eral other systems, including streaming and parallel 
databases [1, 10, 11, 13, 14, 19], publish-subscribe 
systems [2, 9, 16, 27, 31], search engines and web 
crawlers [8, 12, 20, 21], and packet filters [23, 25, 30, 
32, 34]. Our design borrows techniques from each, but 
we argue that the substantial differences in the work- 
load, scale, and application requirements of extensible 
crawlers mandate many different design choices and op- 
timizations. We compare XCrawler to related systems in 
the related work section (Section 5), and we provide an 
in-depth comparison to search engines in Section 2.1. 


2 Overview 


To better motivate the goals and requirements of 
extensible crawlers, we now describe a set of web- 
crawler applications that we have experimented with us- 
ing our prototype system. Table | gives some order- 
of-magnitude estimates of the workload that we expect 
these applications would place on an extensible crawler 
if deployed at scale, including the total number of filters 
each application category would create and the selectiv- 
ity of aclient’s filter set. 

Keyword-based notification. Similar to Google 
Alerts, this application allows users to register keyword 
phrases of interest, and receive an event stream corre- 
sponding to Web pages containing those keywords. For 
example, users might upload a vanity filter (“Jonathan 
Hsieh’’), or a filter to track a product or company (“palm 
pre’). This application must support a large number of 
users, each with a small and relatively slowly-changing 
filter set. Each filter should be highly selective, matching 
a very small fraction of Web pages. 
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keyword Web copyright Web 
notification | malware | violation research 
# clients ~10°6 ~1042 ~10*2 10% 
# filters per client ~1042 ~10°6 ~10%6 = 0*%2 
fraction of pages that ~104-5 ~10A-4 ~10A-6 ~4104-3 
match for a client 
total # filters ~1048 ~104%8 ~10%8 ~ 1045 











Table 1: Web-crawler application workloads. This ta- 
ble summarizes the approximate filter workloads we ex- 
pect from four representative applications. 


Web malware detection. This application uses a 
database of regular-expression-based signatures to iden- 
tify malicious executables, JavaScript, or Web con- 
tent. New malware signatures are injected daily, and 
clients require prompt notification when new malicious 
pages are discovered. This application must support a 
small number of clients (e.g., McAfee, Google, and 
Symantec), each with a large and moderately quickly 
changing filter set. Each filter should be highly selec- 
tive; in aggregate, approximately roughly 1 in 1000 Web 
pages contain malicious content [26, 28]. 

Copyright violation detection. Similar to commercial 
offerings such as attributor.com, this application 
lets clients find Web pages containing content contain- 
ing their intellectual property. A client, such as a news 
provider, maintains a large database of highly selective 
filters, such as key sentences from their news articles. 
New filters are injected into the system by a client when- 
ever new content is published. This application must sup- 
port a moderate number of clients, each with a large, se- 
lective, and potentially quickly changing filter set. 

Web measurement research. This application permits 
scientists to perform large-scale measurements of the 
Web to characterize its content and dynamics. Individual 
research projects would inject filters to randomly sample 
Web pages (e.g., sample 1 in 1000 random pages as rep- 
resentative of the overall Web) or to select Web pages 
with particular features and tags relevant to the study 
(e.g., select Ajax-related JavaScript keywords in a study 
investigating the prevalence of Ajax on the Web). This 
application would support a modest number of clients 
with a moderately sized, slowly changing filter set. 


2.1 Comparison to a search engine 


At first glance, one might consider implementing an 
extensible crawler as a layer on top of a conventional 
search engine. This strawman would periodically exe- 
cute filters against the search engine, looking for new 
document matches and transmitting those to applica- 
tions. On closer inspection, however, several fundamen- 
tal differences between search engines and extensible 
crawlers, their workloads, and their performance require- 
ments are evident, as summarized in Table 2. Because of 
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Table 2: Search engines vs. extensible crawlers. This 
table summarizes key distinctions between the workload, 
performance, and scalability requirements of search en- 
gines and extensible crawlers. 


these differences, we argue that there is an opportunity 
to design an extensible crawler that will scale more effi- 
ciently and better suit the needs of its applications than a 
search-engine-based implementation. 

In many regards, an extensible crawler is an inversion 
of a search engine. A search engine crawls the Web to pe- 
riodically update its stored index of Web documents, and 
receives a stream of Web queries that it processes against 
the document index on-the-fly. In contrast, an extensi- 
ble crawler periodically updates its stored index of filters, 
and receives a stream of Web documents that it processes 
against the filter index on-the-fly. For a search engine, 
though it is important to reduce the time in between doc- 
ument index updates, it is crucial to minimize query re- 
sponse time. For an extensible crawler, it is important to 
be responsive in receiving filter updates from clients, but 
for “real-time Web” applications, it is more important to 
process crawled documents with low latency. 

There are also differences in scale between these two 
systems. A search engine must store and index hundreds 
of billions, if not trillions, of Web documents, contain- 
ing kilobytes or megabytes of data. On the other hand, 
an extensible crawler must store and index hundreds of 
millions, or billions, of filters; our expectation is that fil- 
ters are small, perhaps dozens or hundreds of bytes. As a 
result, an extensible crawler must store and index four or 
five orders of magnitude less data than a search engine, 
and it is more likely to be able to afford to keep its entire 
index resident in memory. 

Finally, there are important differences in the perfor- 
mance and result accuracy requirements of the two sys- 
tems. A given search engine query might match millions 
of Web pages. To be usable, the search engine must rely 
heavily on page ranking to present the top matches to 


users. Filters for an extensible crawler are assumed to 
be more selective than search engine queries, but even 
if they are not, filters are executed against documents as 
they are crawled rather than against the enormous Web 
corpus gathered by a search engine. All matching pages 
found by an extensible crawler are communicated to a 
web-crawler application; result ranking 1s not relevant. 

Traditional search engines and extensible crawlers are 
in some ways complementary, and they can co-exist. Our 
work focuses on quickly matching freshly crawled docu- 
ments against a set of filters, however, many applications 
can benefit from being able to issue queries against a full, 
existing Web index in addition to filtering newly discov- 
ered content. 


2.2 Architectural goals 


Our extensible crawler architecture has been guided 
by several principles and system goals: 


High Selectivity. The primary role of an extensi- 
ble crawler is to reduce the number of web pages a 
web-crawler application must process by a substantial 
amount, while preserving pages in which the applica- 
tion might have interest. An extensible crawler can be 
thought of as a highly selective, programmable matching 
filter executing as a pipeline stage between a stock Web 
crawler and a web-crawler application. 


Indexability. When possible, an extensible crawler 
should trade off CPU for memory to reduce the compu- 
tational cost of supporting a large number of filters. In 
practice, this implies constructing an index over filters 
to support the efficient matching of a document against 
all filters. One implication of this is that the index must 
be kept up-to-date as the set of filters defined by web- 
crawler applications is updated. If this update rate is low 
or the indexing technique used supports incremental up- 
dates, keeping the index up-to-date should be efficient. 


Favor Efficiency over Precision. There is generally 
a tradeoff between the precision of a filter and its effi- 
cient execution, and in these cases, an extensible crawler 
should favor efficient execution. For example, a filter 
language that supports regular expressions can be more 
precise than a filter language that supports only conjuncts 
of substrings, but it is simpler to build an efficient index 
over the latter. As we will discuss in Section 3.2.2, our 
XCrawler prototype implementation exposes a rich fil- 
ter language to web-crawler applications, but uses relax- 
ation to convert precise filters into less-precise, indexable 
versions, increasing its scalability at the cost of exposing 
false positive matches to the applications. 


Low Latency. To support crawler-applications that de- 
pend on real-time Web content, an extensible crawler 
should be capable of processing Web pages with low la- 
tency. This goal suggests the extensible crawler should 
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Figure 1: Extensible crawler architecture. This fig- 
ure depicts the high-level architecture of an extensible 


crawler, including the flow of documents from the Web 
through the system. 
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be architected as a stage in a dataflow pipeline, rather 
than as a batch or map-reduce style computation. 


Scalability. An extensible crawler should scale up to 
support high Web page processing rates and a very large 
number of filters. One of our specific goals is to han- 
dle a linear increase in document processing rate with a 
corresponding linear increase in machine resources. 


2.3 System architecture 


Figure 1 shows the high-level architecture of an ex- 
tensible crawler. A conventional Web crawler is used to 
fetch a high rate stream of documents from the Web. De- 
pending on the needs of the extensible crawler’s appli- 
cations, this crawl can be broad, focused, or both. For 
example, to provide applications with real-time informa- 
tion, the crawler might focus on real-time sources such 
as Twitter, Facebook, and popular news sites. 

Web documents retrieved by the crawler are parti- 
tioned across pods for processing. A pod is a set of 
nodes that, in aggregate, contains all filters known to the 
system. Because documents are partitioned across pods, 
each document needs to be processed by a single pod; 
by increasing the number of pods within the system, the 
overall throughput of the system increases. Document 
set partitioning therefore facilitates the scaling up of the 
system’s document processing rate. 

Within each pod, the set of filters known to the exten- 
sible crawler is partitioned across the pod’s nodes. Filter 
set partitioning is a form of sharding and it is used to 
address the memory or CPU limitations of an individual 
node. As more filters are added to the extensible crawler, 
additional nodes may need to be added to each pod, and 
the partitioning of filters across nodes might need adjust- 
ment. Because filters are partitioned across pod nodes, 
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each document arriving at a pod needs to be distributed 
to each pod node for processing. Thus, the throughput of 
the pod is limited by the slowest node within the pod; this 
implies that load balancing of filters across pod nodes is 
crucially important to the overall system throughput. 

Each node within the pod contains a subset of the sys- 
tem’s filters. A naive approach to processing a docu- 
ment on a node would involve looping over each filter 
on that node serially. Though this approach would work 
correctly, it would scale poorly as the number of filters 
grows. Instead, as we will discuss in Section 3.2, we 
trade memory for computation by using filter indexing, 
relaxation, and staging techniques; this allows us to eval- 
uate a document against a node’s full filter set with much 
faster than linear processing time. 

If a document matches any filters on a node, the node 
notifies a match collector process running within the pod. 
The collector gathers all filters that match a given docu- 
ment and distributes match notifications to the appropri- 
ate web-crawler application clients. 

Applications interact with the extensible crawler 
through two interfaces. They upload, delete, or modify 
filters in their filter sets with the filter management API. 
As well, they receive a stream of notification events cor- 
responding to documents that match at least one of their 
filters through the notification API. We have considered 
but not yet experimented with other interfaces, such one 
for letting applications influence the pages that the web 
crawler visits. 


3 XCrawler Design and Implementation 


In this section, we describe the design and imple- 
mentation of XCrawler, our prototype extensible crawler. 
XCrawler is implemented in Java and runs on a cluster of 
commodity multi-core x86 machines, connected by a gi- 
gabit switched network. Our primary optimization con- 
cern while building XCrawler was efficiently scaling to 
a large number of expressive filters. 

In the rest of this section, we drill down into four as- 
pects of XCrawler’s design and implementation: the fil- 
ter language it exposes to clients, how a node matches 
an incoming document against its filter set, how docu- 
ments and filters are partitioned across pods and nodes, 
and how clients are notified about matches. 


3.1 Filter language and document model 


XCrawler’s declarative filter language strikes a bal- 
ance between expressiveness for the user and execution 
efficiency for the system. The filter language has four 
entities: attributes, operators, values, and expressions. 
There are two kinds of values: simple and composite. 
Simple values can be of several types, including byte se- 
quences, strings, integers and boolean values. Composite 
values are tuples of values. 
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A document is tuple of attribute and values pairs. 
Attributes are named fields within a document; during 
crawling, each Web document is pre-processed to extract 
a Static set of attributes and values. This set is passed 
to nodes and is referenced by filters during execution. 
Examples of a document’s attribute-value pairs include 
its URL, the raw HTTP content retrieved by the crawler, 
certain HTTP headers like Content-Length or Content- 
Type, and if appropriate, structured text extracted from 
the raw content. To support sampling, we also provide 
a random number attribute whose per-document value is 
fixed at chosen when other attributes are extracted. 

A user-provided filter is a predicate expression; if the 
expression evaluates to true against a document, then the 
filter matches the document. A predicate expression 1s e1- 
ther a boolean operator over a single document attribute, 
or a conjunct of predicate expressions. A boolean opera- 
tor expression is an (attribute, operator, value) triple, and 
is represented in the form: 


attribute.operator (value) 


The filter language provides expensive operators such 
as substring and regular expression matching as well as 
simple operators like equalities and inequalities. 

For example, a user could specify a search for the 
phrase “Barack Obama” in HTML files by specifying: 


mimetype.equals ("text/html") & 
text.substring("Barack Obama") 


Alternatively, the user could widen the set of accept- 
able documents by specifying a conjunction of multiple, 
less restrictive keyword substring filters. 


mimetype.equals ("text/html") & 
text.substring("Barack") & 
text.substring ("Obama") 


Though simple, this language is rich enough to sup- 
port the applications outlined previously in Section 2. 
For example, our prototype Web malware detection ap- 
plication is implemented as a set of regular expression 
filters derived from the ClamAV virus and malware sig- 
nature database. 


3.2 Filter execution 


When a newly crawled document is dispatched to a 
node, that node must match the document against its set 
of filters. As previously mentioned, a naive approach 
to executing filters would be to iterate over them se- 
quentially; unfortunately, the computational resources 
required for this approach would scale linearly with both 
the number of filters and the document crawl rate, which 
is severely limiting. Instead, we must find a way to opti- 
mize the execution of a set of filters. 

To do this, we rely on three techniques. To main- 
tain throughput while scaling up the number of filters 


on a node, we create memory-resident indexes for the 
attributes referenced by filters. Matching a document 
against an indexed filter set requires a small number of 
index lookups, rather than computation proportional to 
the number of filters. However, a high fidelity index 
might require too much memory, and constructing an ef- 
ficient index over an attribute that supports a complex op- 
erator such as a regular expression might be intractable. 
In either case, we use relaxation to convert a filter into 
a form that is simpler or cheaper to index. For example, 
we can relax a regular expression filter into one that uses 
a conjunction of substring operators. 

A relaxed filter is less precise than the full filter from 
which it was derived, potentially causing false positives. 
If the false positive rate is too high, we can feed the tenta- 
tive matches from the index lookups into a second stage 
that executes filters precisely but at higher cost. By stag- 
ing the execution of some filters, we regain higher preci- 
sion while still controlling overall execution cost. How- 
ever, if the false positive rate resulting from a relaxed 
filter is acceptably low, staging is not necessary, and all 
matches (including false positives) are sent to the client. 
Whether a false positive rate is acceptable depends on 
many factors, including the execution cost of staging in 
the extensible crawler, the bandwidth overhead of trans- 
mitting false positives to the client, and the cost to the 
client of handling false positives. 


3.2.1 Indexing 


Indexed filter execution requires the construction of 
an index for each attribute that a filter set references, and 
for each style of operator that is used on those attributes. 
For example, if a filter set uses a substring operator over 
the document body attribute, we build an Aho-Corasick 
multistring search trie [3] over the values specified by fil- 
ters referencing that attribute. As another example, if a 
filter set uses numeric inequality operators over the docu- 
ment size attribute, we construct a binary search tree over 
the values specified by filters referencing that attribute. 

Executing a document against a filter set requires 
looking up the document’s attributes against all indexes 
to find potentially matching filters. For filters that con- 
tain a conjunction of predicate expressions, we could in- 
sert each expression into its appropriate index. Instead, 
we identify and index only the most selective predicate 
expression; if the filter survives this initial index lookup, 
we can either notify the client immediately and risk false 
positives or use staging (discussed in Section 3.2.3) to 
evaluate potential matches more precisely. 

Creating indexes lets us execute a large number of fil- 
ters efficiently. Figure 2 compares the number of nodes 
that would be required in our XCrawler prototype to sus- 
tain a crawl rate of 100,000 documents per second, using 
either naive filter execution or filter execution with in- 
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Figure 2: Indexed filter execution. This graph com- 
pares the number of nodes (machines) required for a 
crawl rate of 100,000 documents per second when us- 
ing naive filter execution and when using indexed filter 
execution, including relaxation and staging. 


dexing, relaxation, and staging enabled. The filter set 
used in this measurement are sentences extracted from 
Wikipedia articles; this emulates the workload of a copy- 
right violation detection application. Our measurements 
were gathered on a small number of nodes, and pro- 
jected upwards to larger numbers of nodes assuming lin- 
ear scaleup in crawl rate with document set partitioning. 

Our prototype runs on 8 core, 2 GHz Intel processors. 
When using indexing, relaxation, and staging, a node 
with 3GB of RAM is capable of storing approximately 
400,000 filters of this workload, and can process docu- 
ments at a rate of approximately 9,000 documents per 
second. To scale to 100,000 documents per second, we 
would need 12 pods, 1.e., we must replicate the full filter 
set 12 times, and partition incoming documents across 
these replicas. To scale to 9,200,000 filters, we would 
need to partition the filter set across 24 machines with 
3GB of RAM each. Thus, the final system configuration 
would have 12 pods, each with 24 nodes, for a total of 
288 machines. If we installed more RAM on each ma- 
chine, we would need commensurately fewer machines. 

Even when including the additional cost of staging, 
indexed execution can provide several orders of magni- 
tude better scaling characteristics than naive execution 
as the number of filters grows. Note that the CPU is the 
bottleneck resource for execution in both cases, although 
with staged indexing, staging causes the CPUs to be pri- 
marily occupied with processing false positives from re- 
laxed filters. 


3.2.2 Relaxation 


We potentially encounter two problems when using 
indexing: the memory footprint of indexes might be ex- 
cessive, and it might be infeasible to index attributes or 
operators such as regular expressions or conjuncts. To 
cope with either problem, we use relaxation to convert a 
filter into a form that is less accurate but indexable. 

As one example, consider copyright violation detec- 
tion filters that contain sentences that should be searched 
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Figure 3: Indexed and relaxed filter memory foot- 
print. This graph compares the memory footprint of sub- 
string filters when using four different execution strate- 
gies. The average filter length in the filter set was 130 
bytes, and relaxation used 32 byte ngrams. Note that the 
relaxed+indexed and the relaxed+indexed+staged lines 
overlap on the graph. 


for as substrings within documents. Instead of search- 
ing for the full sentence, filters can be relaxed to search 
for an ngram extracted from the sentence (e.g., a 16 byte 
character fragment). This would significantly reduce the 
size of the in-memory index. 


There are many possible ngram relaxations for a spe- 
cific string; the ideal relaxation would be just as selec- 
tive as the full sentence, returning no false positives. 
Intuitively, shorter ngrams will tend to be less selec- 
tive but more memory efficient. Less intuitively, dif- 
ferent fragments extracted from the same string might 
have different selectivity. Consider the string <a 
href="http://zyzzyva.com">, and two possi- 
ble 8-byte relaxations <a href= and /zyzzyva: the 
former would be much less selective than the latter. 
Given this, our prototype gathers run-time statistics on 
the hit rate of relaxed substring operations, identifies re- 
laxations that have anomalously high hit rates, and se- 
lects alternative relaxations for them. If we cannot find a 
low hit rate relaxation, we ultimately reject the filter. 


Relaxation also allows us to index operations that are 
not directly or efficiently indexable. Conjuncts are not 
directly indexable, but can be relaxed by picking a selec- 
tive indexable subexpression. A match of this subexpres- 
s10n is not as precise as the full conjunction, but can elim- 
inate a large portion of true negatives. Similarly, regular 
expressions could hypothetically be indexed by combin- 
ing their automata, but combined automata tend to have 
exponentially large state requirements or high computa- 
tional requirements [12, 23, 32]. Instead, if we can iden- 
tify substrings that the regular expression implies must 
occur in an accepted document, we can relax the regular 
expression into a less selective but indexable substring. If 
a suitably selective substring cannot be identified from a 
given regular expression, that filter can be rejected when 
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Figure 4: Filter relaxation trade-off. This graph illus- 
trates the trade-off between memory footprint and false 
positive rates when using different degrees of relaxation. 


the client application attempts to upload it. 

Figure 3 compares the memory footprint of naive, 
indexed, relaxed+indexed, and relaxed+indexed+staged 
filter execution. The filter set used in this measurement is 
the same as in Figure 2, namely sentences extracted from 
Wikipedia articles, averaging 130 characters in length. 
Relaxation consists of selecting a random 32 character 
substring from a sentence. The figure demonstrates that 
indexing imposes a large memory overhead relative to 
naive execution, but that relaxation can substantially re- 
duce this overhead. 

Relaxation potentially introduces false positives. Fig- 
ure 4 illustrates the trade-off between the memory foot- 
print of filter execution and the hit rate, as the degree of 
relaxation used varies. With no relaxation, an indexed 
filter set of 400,000 Wikipedia sentences averaging 130 
characters in length requires 8.7GB of memory and has 
a hit rate of 0.25% of Web documents. When relaxing 
these filters to 64 byte ngrams, the memory footprint is 
reduced to 3.5GB and the hit rate marginally climbs to 
0.26% of documents. More aggressive relaxation causes 
a substantial increase in false positives. With 32 byte 
ngrams, the memory footprint is just 1.4GB, but the hit 
rate grows to 1.44% of documents: nearly four out of five 
hits are false positives. 


3.2.3 Staging 


If a relaxed filter causes too many false positives, we 
can use staging to eliminate them at the cost of additional 
computation. More specifically, if a filter is marked 
for staging, any document that matches the relaxed ver- 
sion of the filter (a partial hit) is subsequently executed 
against the full version of that filter. Thus, the first stage 
of filter execution consists of index lookups, while the 
second stage of execution iterates through the partial hits 
identified by the first stage. 

The second stage of execution does not benefit from 
indexing or relaxation. Accordingly, if the partial hit rate 
in the first stage is too high, the second stage of execution 


has the potential to dominate computation time and limit 
the throughput of the system. As well, any filter that is 
staged requires the full version of the filter to be stored in 
memory. Staging eliminates false positives, but has both 
a computational and memory cost. 


3.3. Partitioning 


As with most cluster-based services, the extensible 
crawler achieves cost-efficient scaling by partitioning 
its work across inexpensive commodity machines. Our 
workload consists of two components: documents that 
continuously arrive from the crawler and filters that are 
periodically uploaded or updated by client applications. 
To scale, the extensible crawler must find an intelligent 
partitioning of both documents and filters across ma- 
chines. 


3.3.1 Document set partitioning 


Our first strategy, which we call document set parti- 
tioning, 1s used to increase the overall throughput of the 
extensible crawler. As previously described, we define 
a pod as a set of nodes that, in aggregate, contains all 
filters known to the system. Thus, each pod contains all 
information necessary to process a document against a 
filter set. To increase the throughput of the system, we 
can add a pod, essentially replicating the configuration 
of existing pods onto a new set of machines. 

Incoming documents are partitioned across pods, and 
consequently, each document must be routed to a single 
pod. Since each document is processed independently 
of others, no interaction between pods is necessary in the 
common case. Document set partitioning thus leads to an 
embarrassingly parallel workload, and linear scalability. 
Our implementation monitors the load of each pod, pe- 
riodically adjusting the fraction of incoming documents 
directed to each pod to alleviate hot spots. 


3.3.2 Filter set partitioning 


Our second strategy, which we call filter set partition- 
ing, 1s used to address the memory and CPU limitations 
of an individual node within a pod. Filter set partition- 
ing is analogous to sharding, declustering, and horizontal 
partitioning. Since indexing operations are memory in- 
tensive, any given node can only index a bounded num- 
ber of filters. Thus, as we scale up the number of filters 
in the system, we are forced to partition filters across the 
nodes within a pod. 

Our system supports complex filters composed of a 
conjunction of predicate expressions. In principle, we 
could decompose filters into predicates, and partition 
predicates across nodes. In practice, our implementation 
uses the simpler approach of partitioning entire filters. 
As such, a document that arrives at a node can be fully 
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Figure 5: Node throughput. This (sorted) graph shows 
the maximum throughput each node within a pod is ca- 
pable of sustaining under two different policies: random 
filter placement, and alphabetic filter placement. 


evaluated against each filter on that node without requir- 
ing any cross-node interactions. 

Since a document must be evaluated against all fil- 
ters known by the system, each document arriving at a 
pod must be transmitted to and evaluated by each node 
within the pod. Because of this, the document through- 
put that a pod can sustain is limited by the throughput of 
the slowest node within the pod. 

Two issues substantially affect node and pod through- 
put. First, a filter partitioning policy that is aware of the 
indexing algorithms used by nodes can tune the place- 
ment of filters to drive up the efficiency and throughput 
of all nodes. Second, some filters are more expensive to 
process than others. Particularly expensive filters can in- 
duce load imbalances across nodes, driving down overall 
pod throughput. 

Figure 5 illustrates these effects. Using the same 
Wikipedia workload as before, this graph illustrates the 
maximum document throughput that each node within a 
pod of 24 machines is capable of sustaining, under two 
different filter set partitioning policies. The first policy, 
random, randomly places each filter on a node, while the 
second policy, alphabetic, sorts the substring filters al- 
phabetically by their most selective ngram relaxation. By 
sorting alphabetically, the second policy causes ngrams 
that share prefixes to end up on the same node, improv- 
ing both the memory and computation efficiency of the 
Aho-Corasick index. The random policy achieves good 
load balancing but suffers from lower average through- 
put than alphabetic. Alphabetic exhibits higher average 
throughput but suffers from load imbalance. From our 
measurements using the Wikipedia filter set, a 5 million 
filter index using random placement requires 13.9GB of 
memory, while a 5 million filter index using alphabetic 
placement requires 12.1GB, a reduction of 13%. 

In Figure 6, we measure the relationship between the 
number of naive evaluations that must be executed per 
document when using staged relaxation and the through- 
put a node can sustain. As the number of naive execu- 
tions increases, throughput begins to drop, until eventu- 
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Figure 6: Staged evaluations vs. throughput. This 
graph shows the effect of increasing partial hit rate of 
a filter set within a node on the maximum document 
throughput that node is capable of processing. 


ally the node spends most of its time performing these 
evaluations. In practice, the number of naive evaluations 
can increase for two reasons. First, a given relaxation can 
be shared by many independent filters. If the relaxation 
matches, all associated filters must be executed fully. 
Second, a given relaxation might match a larger-than- 
usual fraction of incoming documents. In Section 4.2, 
we further quantify these two effects and propose strate- 
gies for mitigating them. 

Given all of these issues, our implementation takes 
the following strategy to partition filters. Initially, 
an index-aware partitioning is chosen; for example, 
alphabetically-sorted prefix grouping is used for sub- 
string filters. Filters are packed onto nodes until mem- 
ory is exhausted. Over time, the load of nodes within 
each pod is monitored. If load imbalances appear within 
a pod, then groups of filters are moved from slow nodes 
to faster nodes to rectify the imbalance. Note that mov- 
ing filters from one node to another requires the indexes 
on both nodes to be recomputed. The expense of doing 
this bounds how often we can afford to rebalance. 

A final consideration is that the filter set of an exten- 
sible crawler changes over time as new filters are added 
and existing filters are modified or removed. We cur- 
rently take a simple approach to dealing with filter set 
changes: newly added or modified filters are accumu- 
late in “overflow” nodes within each pod and are ini- 
tially executed naively without the benefit of indexing. 
We then take a generational approach to re-indexing and 
re-partitioning filters: newly added and modified filters 
that appear to be stable are periodically incorporated into 
the non-overflow nodes. 


3.4 Additional implementation details 


The data path of the extensible crawler starts as doc- 
uments are dispatched from a web crawler into our sys- 
tem and ends as matching documents are collected from 
workers and transmitted to client crawler applications 
(see Figure 1). Our current prototype does not fully ex- 
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plore the design and implementation issues of either the 
dispatching or collection components. 

In our experiments, we use the open source Nutch spi- 
der to crawl the web, but we modified it to store docu- 
ments locally within each crawler node’s filesytem rather 
than storing them within a distributed Hadoop filesys- 
tem. We implemented a parallel dispatcher that runs on 
each crawler node. Each dispatcher process partitions 
documents across pods, replicates documents across pod 
nodes, and uses backpressure from nodes to decide the 
rate at which documents are sent to each pod. Each pod 
node keeps local statistics about filter matching rates, 
annotates matching documents with a list of filters that 
matched, and forwards matching documents to one of a 
static set of collection nodes. 

An interesting configuration problem concerns bal- 
ancing the CPU, memory, and network capacities of 
nodes within the system. We ensure that all nodes within 
a pod are homogeneous. As well, we have provisioned 
each node to ensure that the network capacity of nodes 
is not a system bottleneck. Doing so required provi- 
sioning each filter processing node with two 1-gigabit 
NICs. To take advantage of multiple cores, our filter pro- 
cessing nodes use two threads per core to process doc- 
uments against indexes concurrently. As well, we use 
one thread per NIC to pull documents from the network 
and place them in a queue to be dispatched to filter pro- 
cessing threads. We can add additional memory to each 
node until the cost of additional memory becomes pro- 
hibitive. Currently, our filter processing nodes have 3GB 
of RAM, allowing each of them to store approximately a 
half-million filters. 

Within the extensible crawler itself, all data flows 
through memory; no disk operations are required. Most 
memory is dedicated to filter index structures, but some 
memory is used to queue documents for processing and 
to store temporary data generated when matching a doc- 
ument against an index or an individual filter. 

We have not yet explored fault tolerance issues. Our 
prototype currently ignores individual node failures and 
does not attempt to detect or recover from network or 
switch failures. If a filter node fails in our current im- 
plementation, documents arriving at the associated pod 
will fail to be matched against filters that resided on that 
node. Note that our overall application semantics are best 
effort: we do not (yet) make any guarantees to client ap- 
plications about when any specific web page is crawled. 
We anticipate that this will simplify fault tolerance 1s- 
sues, since it is difficult for clients to distinguish between 
failures in our system and the case that a page has not yet 
been crawled. Adding fault tolerance and strengthening 
our service guarantees is a potentially challenging future 
engineering topic, but we do not anticipate needing to 
invent fundamentally novel mechanisms. 


3.5 Future considerations 


There are several interesting design and implementa- 
tion avenues for the extensible crawler. Though they are 
beyond the scope of this paper, it is worth briefly men- 
tioning a few of them. Our system currently only indexes 
textual documents; in the future, it would be interesting 
to consider the impact of richer media types (such as 1m- 
ages, videos, or flash content) on the design of the filter 
language and on our indexing and execution strategy. We 
currently consider the crawler itself to be a black box, but 
given that clients already specify content of interest to 
them, it might be beneficial to allow clients to focus the 
crawler on certain areas of the Web of particular inter- 
est. Finally, we could imagine integrating other streams 
of information into our system besides documents gath- 
ered from a Web crawler, such as real-time “firehoses”’ 
produced by systems such as Twitter. 


4 Evaluation 


In this section, we describe experiments that explore 
the performance of the extensible crawler, we investi- 
gate the effect of different filter partitioning policies. As 
well, we demonstrate the need to identify and reject non- 
selective filters. Finally, we present early experience 
with three prototype Web crawler applications. 


All of our experiments are run on a cluster of 8-core, 
2GHz Intel Xeon machines with 3GB of RAM, dual 
gigabit NICs, and a 500 GB 7200-RPM Barracuda ES 
SATA hard drive. Our systems are configured to run 
32bit Linux kernel version 2.6.22.9-91.fc7, and to use 
Sun’s 23 bit JVM version 1.6.0_12 in server mode. Un- 
less stated otherwise, the filter workload for our experi- 
ments consists of 9,204,600 unique sentences extracted 
from Wikipedia; experiments with relaxation and stag- 
ing used 32 byte prefix ngrams extracted from the filter 
sentences. 


For our performance oriented experiments, we gath- 
ered a 3,349,044 Web document crawl set on August 
24th, 2008 using the Nutch crawler and pages from the 
DMOZ open directory project as our crawl seed. So that 
our experiments were repeatable, when testing the per- 
formance of the extensible crawler we used on a cus- 
tom tool to stream this document set at high throughput, 
rather than re-crawling the Web. Of the 3,349,044 docu- 
ments in our crawl set, 2,682,590 contained textual con- 
tent, including HTML and PDF files; the rest contain bi- 
nary content, including images and executables. Our ex- 
tensible crawler prototype does not yet notify wide-area 
clients about matching documents; instead, we gather 
statistic about document matches, but drop the matching 
documents instead of transmitting them. 
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Figure 7: Lucene vs. extensible crawler. This graph 
compares the document processing rates of a single-node 
extensible crawler and a single-node Lucene search en- 
gine. The x-axis displays the number of documents 
crawled between reconstructions of the Lucene index. 
Note that the y-axis is logarithmically scaled. 


4.1 Nutch vs. the extensible crawler 


In Section 2.1, we described architectural, workload, 
and expected performance differences between the ex- 
tensible crawler and an alternative implementation of 
the service based on a conventional search engine. To 
demonstrate these differences quantitatively, we ran a se- 
ries of experiments directly comparing the performance 
of our prototype to an alternative implementation based 
on the Lucene search engine, version 2.1.0 [7]. 

The construction of a Lucene-based search index 
is typically performed as part of a Nutch map-reduce 
pipeline that crawls web pages, stores them in the HDFS 
distributed filesystem, builds and stores an index in 
HDEBS, and then services queries by reading index en- 
tries from HDFS. To make the comparison of Lucene to 
our prototype more fair, we eliminated overheads intro- 
duced by HDFS and map-reduce by modifying the sys- 
tem to store crawled pages and indexes in nodes’ local 
filesystems. Similarly, to eliminate variation introduced 
by the wide-area Internet, we spooled our pre-crawled 
Web page data set to Lucene’s indexer or to the extensi- 
ble crawler over the network. 

The search engine implementation works by periodi- 
cally constructing an index based on the VV most recently 
crawled web pages; after constructing the index and par- 
titioning it across nodes, each node evaluates the full fil- 
ter set against its index fragment. The implementation 
uses one thread per core to evaluate filters. By increasing 
N, the implementation indexes less frequently, reducing 
overhead, but suffers from a larger latency between the 
downloading of a page by the crawler and the evaluation 
of the filter set against that page. In contrast, the exten- 
sible crawler implementation constructs and index over 
its filters once, and then continuously evaluates pages 
against that index. 

In Figure 7, we compare the single node throughput 
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Figure 8: Collision set size. This histogram shows the 
distribution of collision set sizes when using 32-byte 
ngrams over the Wikipedia filter set. 


of the two implementations, with 400,000 filters from the 
Wikipedia workload. Note that the x-axis corresponds to 
N, but this parameter only applies to the Lucene crawler. 
The extensible crawler implementation has nearly two 
orders of magnitude better performance than Lucene; 
this is primarily due to the fact that Lucene must service 
queries against its disk-based document index, while the 
extensible crawler’s filter index is served out of mem- 
ory. As well, the Lucene implementation is only able to 
achieve asymptotic performance if it indexes batches of 
N > 1,000, 000 documents. 

Our head-to-head comparison is admittedly still un- 
fair, since Lucene was not optimized for fast, incremen- 
tal, memory-based indexing. Also, we could conceivably 
bridge the gap between the two implementations by us- 
ing SSD drives instead of spinning platters to store and 
serve Lucene indexes. However, our comparison serves 
to demonstrate some of the design tensions between con- 
ventional search engines and extensible crawlers. 


4.2 Filter partitioning and blacklisting 


As mentioned in Section 3.3.2, two different aspects 
of a filter set contribute to load imbalances between oth- 
erwise identical machines: first, a specific relaxation 
might be shared by many different filters, causing a par- 
tial hit to result in commensurately many naive filter ex- 
ecutions, and second, a given relaxation might match a 
large number of documents, also causing a large number 
of naive filter executions. We now quantify these effects. 

We call the a set of filters that share an identical relax- 
ation a collision set. A collision set of size 1 implies the 
associated filter’s relaxation is unique, while a collision 
set of size V implies that N filters share a specific relax- 
ation. In Figure 8, we show the distribution of collision 
set sizes when using a 32-byte prefix relaxation of the 
Wikipedia filter set. The majority of filters (8,434,126 
out of 9,204,600) have a unique relaxation, but some 
relaxations collide with many filters. For example, the 
largest collision set size was 35,585 filters. These filters 
all shared the prefix “The median income for a house- 
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Figure 9: Node throughput. This (sorted) graph shows 
the maximum throughput each node within a pod is capa- 
ble of sustaining under three different policies: random 
filter placement, alphabetic filter placement, and alpha- 
betic filter placement with blacklisting. 


hold in the’; this sentence is used in many Wikipedia 
articles describing income in cities, counties, and other 
population centers. If a document matches against this 
relaxation, the extensible crawler would need to naively 
execute all 35,585 filters. 

Along a similar vein, some filter relaxations 
will match a larger-than-usual fraction of documents. 
One notably egregious example from our Web mal- 
ware detection application was a filter whose relax- 
ation contained was the 32-character sequence <meta 
http-equiv="Content-Type">. Unsurprisingly, a 
very large fraction of Web pages contain this substring! 

To deal with these two sources of load imbalance, our 
implementation blacklists specific relaxations. If our im- 
plementation notices a collision set containing more than 
100 filters, we blacklist the associated relaxation, and 
compute alternate relaxations for those filters. In the case 
of our Wikipedia filter set, this required modifying the re- 
laxation of only 145 filters. As well, if our implementa- 
tion notices that a particular relaxation has an abnormally 
high document partial hit rate, that “hot” relaxation is 
blacklisted and new filter relaxations are chosen. 

Blacklisting rectifies these two sources of load 1im- 
balance. Figure 9 revisits the experiment previously il- 
lustrated in Figure 5, but with a new line that shows 
the effect of blacklisting on the distribution of document 
processing throughputs across the nodes in our cluster. 
Without blacklisting, alphabetic filter placement demon- 
strates significant imbalance. With blacklisting, the ma- 
jority of the load imbalance is removed, and the slowest 
node is only 17% slower than the fastest node. 


4.3 Experiences with Web crawler applications 


To date, we have prototyped three Web crawler ap- 
plications: vanity alters that detect pages containing 
a user’s name, a copyright detection application that 
finds Web objects that match ClamAV’s malware sig- 
nature database, and a copyright violation detection ser- 
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Table 3: Web crawler application features and perfor- 
mance. This table summarizes the high-level workload 
and performance features of our three prototype Web 
crawler applications. 


vice that looks for pages containing copies of Reuters or 
Wikipedia articles. Table 3 summarizes the high-level 
features of these applications and their filter workloads; 
we now discuss each in turn, relating additional details 
and anecdotes. 


4.3.1 Vanity alerts 


For our vanity filter application, we authored 10,622 
filters based on names of university faculty and students. 
Our filters were constructed as regular expressions of 
the form ‘‘first.{1,20}last’’, 1e., the user’s first 
name followed by their last name, with the constraint of 
no more than 20 characters separating the two parts. The 
filters were first relaxed into a conjunct of substring 32- 
grams, and from there the longest substring conjunct was 
selected as the final relaxation of the filter. 

This filter set matched against 13.1% of documents 
crawled. This application had a modest number of fil- 
ters, but its filter set nonetheless matched against a large 
fraction of Web pages, violating our assumption that a 
crawler application should have highly selective filters. 
Moreover, when using relaxed filters without, there were 
many additional false partial hits (an average of 15.76 
per document, and an overall document hit rate of 69%). 
Most false hits were due to names that are contained in 
commonly found words, such as Tran, Chang, or Park. 

The lack of high selectivity of its filter set leads us 
to conclude that this application is not a good candidate 
for the extensible crawler. If millions of users were to 
use this service, most Web pages would likely match and 
need to be delivered to the crawler application. 


4.3.2 Copyright violation detection 


For our second application, we prototyped a copyright 
violation detection service. We evaluated this application 
by constructing a set of 251,647 filters based on 30,534 
AP and Reuters news articles appearing between July 
and October of 2008. Each filter was a single sentence 
extracted from an article, but we extracted multiple fil- 
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ters from each article. We evaluated the resulting filter 
set against a crawl of 3.68 million pages. 

Overall, 590 crawled documents (0.016%) matched 
against the AP/Reuters filter set, and 619 filters (0.028% 
of the filter set) were responsible for these matches. We 
manually determined that most matching documents that 
matched were original news articles or blogger pages that 
quoted sections of articles with attribution. We did find 
some sites that appeared to contain unauthorized copies 
of entire news stories, and some sites that plagiarized 
news stories by integrating the story body but replacing 
the author’s byline. 

If a document hit, it tended to hit against a single fil- 
ter (50% of document hits were for a single filter). A 
smaller number of documents hit against many sentences 
(13% of documents matched against more than 6 filters). 
Documents that matched against many filters tended to 
contain full copies of the original news story, while doc- 
uments that match a single sentence tended to contain 
boilerplate prose, such as a specific author’s byline, legal 
disclosures, or common phrases such as “The officials 
spoke on condition of anonymity because they weren’t 
authorized to release the information.” 


4.3.3. Web malware detection 


The final application we prototyped was Web mal- 
ware detection. We extracted 3,128 text-centric regular 
expressions from the February 20, 2009 release of the 
ClamAV open-source malware signature database. Be- 
cause many of these signatures were designed to match 
malicious JavaScript or HTML, some of their relax- 
ations contain commonly occurring substrings, such as 
<a href=‘*http://’’. As aresult, blacklisting was 
a particularly important optimization for this workload; 
the system was always successful at finding suitably se- 
lective relaxations of each filter. 

Overall, this filter set matched 342 pages from the 
same crawl of 3.68 million pages, with an overall rate of 
0.009%. The majority of hits (229) were for two similar 
signatures that capture obfuscated JavaScript code that 
emits an iframe in the parent page. We examined all of 
the pages that matched this signature; in each case, the 
iframe contained links to other pages that are known to 
contain malicious scripts. Most of the matching pages 
appeared to be legitimate business sites that had been 
compromised. We also found several pages that matched 
a ClamAV signature designed to detect Web bugs. 

In addition to genuinely malicious Web pages, we 
found a handful of pages that were application-level false 
positives, 1.e., they correctly matched a ClamAV filter, 
but the page did not contain the intended attack. Some 
of these application-level false positives contained blog 
entries discussing virulent spam, and the virulent spam 
itself was represented in the ClamAV database. 
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5 Related work 


The extensible crawler is related to several classes 
of systems: Web crawlers and search engines, publish- 
subscribe systems, packet filtering engines, parallel and 
streaming databases, and scalable Internet content syndi- 
cation protocols. We discuss each in turn. 

Web crawlers and search engines. The engineer- 
ing issues of high-throughput Web crawlers are complex 
but well understood [21]. Modern Web crawlers can re- 
trieve thousands of Web pages per second per machine. 
Our work leverages existing crawlers, treating them as a 
black box from which we obtain a high throughput doc- 
ument stream. The Mercator project explored the design 
of an extensible crawler [20], though Mercator’s notion 
of extensibility is different than ours: Mercator has well- 
defined APIs that simplify the job of adding new modules 
that extend the crawler’s set of network protocols or type- 
specific document processors. Our extensible crawler 
permits remote third parties to dynamically insert new 
filters into the crawling pipeline. 

Modern Web search engines require complex engi- 
neering, but the basic architecture of a scalable search 
engine has been understood for more than decade [8]. 
Our extensible crawler is similar to a search engine, but 
inverted, in that we index queries rather than documents. 
As well, we focus on in-memory indexing for through- 
put. Cho and Rajagopalan described a technique for sup- 
porting fast indexing of regular expressions by reducing 
them to ngrams [12]; our notion of filter relaxation is a 
generalization of their approach. 

Though the service is now discontinued, Ama- 
zon.com offered programmatic search access to a 300TB 
archive containing 4 billion pages crawled by Alexa In- 
ternet [5] and updated daily. By default, access was re- 
stricted to queries over a fixed set of search fields, how- 
ever, customers could pay to re-index the full data set 
over custom fields. In contrast, the extensible crawler 
permits customers to write custom filters over any at- 
tribute supported by our document extractors, and since 
we index filters rather than pages, our filters are evalu- 
ated in real-time, at the moment a page is crawled. 

Publish-subscribe systems. The extensible crawler 
can be thought of as a content-based publish-subscribe 
system [22] designed and optimized for a real-time Web 
crawling workload. Content-based pub-sub systems have 
been explored at depth, including in the Gryphon [2], 
Siena [9], Elvin [31], and Le Subscribe [16, 27] projects. 
Many of these projects explore the trade-off between fil- 
ter expressiveness and evaluation efficiency, though most 
have a wide-area, distributed event notification context in 
mind. Le Subscribe is perhaps closest to our own system; 
their language is also a conjunction of predicates, and 
like us, they index predicates in main-memory for scal- 
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able, efficient evaluation. In contrast to these previous 
projects, our work explores in depth the partitioning of 
documents and filters across machines, the suitability of 
our expression language for Web crawling applications, 
the impact of disproportionately high hit rate filters, and 
evaluates several prototype applications. 

Web-based syndication protocols, such as RSS and 
Atom, permit Web clients to poll servers to receive feeds 
of new articles or document elements. Cloud-based ag- 
gregation and push notification services such as rssCloud 
and PubSubHubbub allow clients to register interest in 
feeds and receive notifications when updates occur, re- 
lieving servers from pull-induced overload. These ser- 
vices are roughly equivalent to channel-based pub-sub 
systems, whereas the extensible crawler is more equiva- 
lent to a content-based system. 

The Google alerts system [18] allows users to specify 
standing search queries to be evaluated against Google’s 
search index. Google alerts periodically emails users 
newly discovered search results relevant to their queries. 
Alerts uses two different approaches to gather new re- 
sults: it periodically re-executes queries against the 
search engine and filters previously returned results, and 
it continually matches incoming documents against the 
body of standing user queries. This second approach has 
similarities to the extensible crawler, though details of 
Google alert’s architecture, workload, performance, and 
scalability have not been publicly disclosed, preventing 
an in-depth technical comparison. 

Cobra [29] perhaps most similar to our system. Cobra 
is a distributed system that crawls RSS feeds, evaluates 
articles against user-supplied filters, and uses reflectors 
to distributed matching articles to interested users. Both 
Cobra and the extensible crawler benefit from a filter lan- 
guage design to facilitate indexing. Cobra focused on 
issues of distribution, provisioning, and network-aware 
clustering, whereas our work focuses on a single-cluster 
implementation, efficiency through filter relaxation and 
staging, and scalability through document and filter set 
partitioning. As well, Cobra was oriented towards scal- 
able search and aggregation of Web feeds, whereas the 
extensible crawler provides a platform for more widely 
varied crawling applications, such as malware and copy- 
right violation detection. 

Packet filters and NIDS. Packet filters and network 
intrusion detection systems (NIDS) have similar chal- 
lenges as the extensible crawler: both classes of systems 
must process a large number of filters over a high band- 
width stream of unstructured data with low latency. The 
BSD packet filter allowed control-flow graph filters to be 
compiled down to an abstract filtering machine, and exe- 
cuted safely and efficiently in an OS kernel [25]. Packet 
filtering systems have also confronted the problem of ef- 
ficiently supporting more expressive filters, while pre- 


venting state space explosion when representing large fil- 
ter sets as DFAs or NFAs [23, 32]. Like an extensible 
crawler, packet filtering systems suffer from the prob- 
lem of normalizing documents content before matching 
against filters [30], and of providing additional execution 
context so that byte-stream filters can take advantage of 
higher-level semantic information [34]. Our system can 
benefit from the many recent advances in this class of 
system, though our applications require orders of mag- 
nitude more filters and therefore a more scalable imple- 
mentation. As well, our application domain is more ro- 
bust against false positives. 

Databases and SDIs. The extensible crawler shares 
some design considerations, optimizations, and imple- 
mentation techniques with parallel databases such as 
Bubba [13] and Gamma [14], in particular our need to 
partition filters (queries) and documents (records) over 
machines, and our focus on high selectivity as a path 
to efficiency. Our workload tends to require many more 
concurrent filters, but does not provide the same expres- 
siveness as SQL queries. We also have commonalities 
with streaming database systems and continuous query 
processors [1, 10, 11], in that both systems execute stand- 
ing queries against an infinite stream of data. However, 
streaming database systems tend to focus on semantic 1s- 
sues of queries over limited time windows, particularly 
when considering joins and aggregation queries, while 
we focus on scalability and Web crawling applications. 

Many databases support the notion of triggers that fire 
when matching records are added to the database. Prior 
work has examined indexing techniques for efficiently 
supporting a large number of such triggers [19]. 

Selective Dissemination of Information (SDI) sys- 
tems [35], including those that provide scalable, efficient 
filtering of XML documents [4], share our goal of exe- 
cuting a large number filters over semi-structured docu- 
ments, and rely on the same insight of indexing queries 
to match against individual documents. These systems 
tend to have more complex indexing schemes, but have 
not yet been targeted at the scale, throughput, or applica- 
tion domain of the extensible crawler. 


6 Conclusions 


This paper described the design, prototype implemen- 
tation, and evaluation of the extensible crawler, a service 
that crawls the Web on behalf of its many client applica- 
tions. Clients extend the crawler by injecting filters that 
identify pages of interest to them. The crawler continu- 
ously fetches a stream of pages from the Web, simultane- 
ously executes all clients’ filters against that stream, and 
returns to each client those pages selected by its filter set. 

An extensible crawler provides several benefits. It re- 
lieves clients of the need to operate and manage their 
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own private crawler, greatly reducing a client’s band- 
width and computational needs when locating pages of 
interest. It is efficient in terms of Internet resources: a 
crawler queries a single stream of Web pages on behalf 
of many clients. It also has the potential for crawling 
highly dynamic Web pages or real-time sources of infor- 
mation, notifying clients quickly when new or interesting 
content appears. 

The evaluation of XCrawler, our early prototype sys- 
tem, focused on scaling issues with respect to its num- 
ber of filters and crawl rate. Using techniques from re- 
lated work, we showed how we can support rich, expres- 
sive filters using relaxation and staging techniques. As 
well, we used microbenchmarks and experiments with 
application workloads to quantify the impact of load bal- 
ancing policies and confirm the practicality of our ideas. 
Overall, we believe that the low-latency, high selectivity, 
and scalable nature of our system makes it a promising 
platform for many applications. 
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Abstract 


Byzantine fault-tolerant (BFT) replication has enjoyed a 
series of performance improvements, but remains costly 
due to its replicated work. We eliminate this cost for 
read-mostly workloads through Prophecy, a system that 
interposes itself between clients and any replicated ser- 
vice. At Prophecy’s core is a trusted sketcher compo- 
nent, designed to extend the semi-trusted load balancer 
that mediates access to an Internet service. The sketcher 
performs fast, load-balanced reads when results are his- 
torically consistent, and slow, replicated reads otherwise. 
Despite its simplicity, Prophecy provides a new form of 
consistency called delay-once consistency. Along the 
way, we derive a distributed variant of Prophecy that 
achieves the same consistency but without any trusted 
components. 

A prototype implementation demonstrates Prophecy’s 
high throughput compared to BFT systems. We also de- 
scribe and evaluate Prophecy’s ability to scale-out to sup- 
port large replica groups or multiple replica groups. As 
Prophecy is most effective when state updates are rare, 
we finally present a measurement study of popular web- 
sites that demonstrates a large proportion of static data. 


1 Introduction 


Replication techniques are now the norm in large-scale 
Internet services, in order to achieve both reliability and 
scalability. However, leveraging active agreement to 
mask failures, whether to handle fail-stop behavior [41, 
50] or fully malicious (Byzantine) failures [42], is not 
yet widely used. There is some movement in this direc- 
tion from industry—such as Google’s Chubby [10] and 
Yahoo!’s Zookeeper [66] coordination services, based on 
Paxos [41 ]—but both are used to manage infrastructure, 
not directly mask failures in customer-facing services. 
And yet non-fail-stop failures in customer-facing ser- 
vices continue to occur, much to the chagrin and concern 
of system operators. Failures may arise from malicious 
break-ins, but they also may occur simply from system 
misconfigurations: Facebook leaking source code due to 
one misconfigured server [60], or Flickr mixing up re- 
turned images due to one improper cache server [24]. 
In fact, both of these examples could have been pre- 
vented through redundancy and agreement, without re- 


quiring full N-version programming [8]. The perceived 
need for systems robust to Byzantine faults—a superset 
of misconfigurations and Heisenbugs—has spurned al- 
most a cottage industry on improving performance re- 
sults of Byzantine fault tolerant (BFT) algorithms [1, 6, 
12, 17, 30, 37, 38, 56, 62, 64, 65, 67]. 

While the latency of recent BFT algorithms has 
approached that of unreplicated reads to individual 
servers [15, 38, 64], the throughput of such systems 
falls far short. This is simple math: a minimum of four 
replicas [12] (or sometimes even six [1]) are required 
to tolerate one faulty replica, at least three of which 
must participate in each operation. For datacenters in 
the thousands or tens of thousands of servers, requiring 
four times as many servers without increasing throughput 
may be a non-starter. Even services that already replicate 
their data, such as the Google File System [25], would 
see their throughput drop significantly when using BFT 
agreement. 

But if the replication cost of BFT is provably neces- 
sary [9], something has to give. One might view our 
work as a thought experiment that explores the potential 
benefit of placing a small amount of trusted software or 
hardware in front of a replicated service. After all, wide- 
area client access to an Internet service is typically medi- 
ated by some middlebox, which is then at least trusted to 
provide access to the service. Further, a small and sim- 
ple trusted component may be less vulnerable to prob- 
lems such as misconfigurations or Heisenbugs. And by 
treating the back-end service as an abstract entity that ex- 
poses a limited interface, this simple device may be able 
to interact with both complex and varied services. Our 
implementation of such a device has less than 3000 lines 
of code. 

Barring such a solution, most system designers opt 
either for cheaper techniques (to avoid the costs of 
state machine replication) or more flexible techniques 
(to ensure service availability under heavy failures or 
partitions). The design philosophies of Amazon’s Dy- 
namo [18], GFS [25], and other systems [20, 23, 61] 
embrace this perspective, providing only eventually- 
consistent storage. On the other hand, the tension be- 
tween these competing goals persists, with some systems 
in industry re-introducing stronger consistency proper- 
ties. Examples include timeline consistency in Yahoo!’s 
PNUTS [16] and per-user cache invalidation on Face- 


NSDI 710: 7th USENIX Symposium on Networked Systems Design and Implementation 345 


346 


book [21]. Nevertheless, we are unaware of any major 
use of agreement at the front-tier of customer-facing ser- 
vices. In this paper, we challenge the assumption that 
the tradeoff between strong consistency and cost in these 
services is fundamental. 

This paper presents Prophecy, a system that lowers 
the performance overhead of fault-tolerant agreement for 
customer-facing Internet services, at the cost of slightly 
weakening its consistency guarantees. At Prophecy’s 
core is a trusted sketcher component that mediates client 
access to a service replica group. The sketcher maintains 
a compact history table of observed request/response 
pairs; this history allows it to perform fast, load-balanced 
reads when state transitions do not occur (that is, when 
the current response is identical to that seen in the past) 
and slow, replicated reads otherwise (when agreement 
is required). The sketcher is a flexible abstraction that 
can interface with any replica group, provided it exposes 
a limited set of defined functionality. This paper, how- 
ever, largely discusses Prophecy’s use with BFT replica 
groups. Our contributions include the following: 


¢ When used with BFT replica groups that guaran- 
tee linearizability [32], Prophecy significantly in- 
creases throughput through its use of fast, load- 
balanced reads. However, it relaxes the consistency 
properties to what we term delay-once semantics. 


e We also derive a distributed variant of Prophecy, 
called D-Prophecy, that similarly improves the 
throughput of traditional fault-tolerant systems. D- 
Prophecy achieves the same delay-once consistency 
but without any trusted components. 


e We introduce the notion of delay-once consistency 
and define it formally. Intuitively, it implies that 
faulty nodes can at worst return only stale (not arbi- 
trary) data. 


e We demonstrate how to scale-out Prophecy to sup- 
port large replica groups or many replica groups. 


¢ We implement Prophecy and apply it to BFT replica 
groups. We evaluate its performance on realistic 
workloads, not just null workloads as typically done 
in the literature. Prophecy adds negligible latency 
compared to standard load balancing, while it pro- 
vides an almost linear-fold increase in throughput. 


e Prophecy is most effective in read-mostly work- 
loads where state transitions are rare. We conduct 
a measurement study of the Alexa top-25 websites 
and show that over 90% of requests are for mostly 
static data. We also characterize the dynamism in 
the data. 


Table 1 summarizes the different properties of a tra- 
ditional BFT system, D-Prophecy, and Prophecy. The 
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Table 1: 
Prophecy, and Prophecy. 


Comparison of a traditional BFT system, D- 


remainder of this paper is organized as follows. In §2 we 
motivate the design of D-Prophecy and Prophecy, and we 
describe this design in §3. In §4 we define delay-once 
consistency and analyze Prophecy’s implementation of 
this consistency model. In §5 we discuss extensions to 
the basic system model that consider scale and complex 
component topologies. We detail our prototype imple- 
mentation in §6 and describe our system evaluation in 
§7. In §8 we present our measurement study. We review 
related work in §9 and conclude in §10. 


2 Motivating Prophecy’s Design 


One might rightfully ask whether Prophecy makes un- 
fair claims, given that it achieves performance and scal- 
ability gains at the cost of additional trust assumptions 
compared to traditional fault-tolerant systems. This sec- 
tion motivates our design through the lens of BFT sys- 
tems, in two steps. First, we improve the performance 
of BFT systems on realistic workloads by introducing a 
cache at each replica server. By optimizing the use of this 
cache, we derive a distributed variant of Prophecy that 
does not rely on any trusted components. Then, we ap- 
ply this design to customer-facing Internet services, and 
show that the constraints of these services are best met 
by a shared, trusted cache that proxies client access to the 
service replica group. The resulting system is Prophecy. 

In our discussion, we differentiate between write re- 
quests, or those that modify service state, and read re- 
quests, or those that simply access state. 


2.1 Traditional BFT Services and Real 
Workloads 


A common pitfall of BFT systems is that they are eval- 
uated on null workloads. Not only are these workloads 
unrealistic, but they also misrepresent the performance 
overheads of the system. Our evaluation in §7 shows that 
the cost of executing a non-null read request in the PBFT 
system [12] dominates the cost of agreeing on the order- 
ing of the request, even when the request is served en- 
tirely from main memory. Thus the PBFT read optimiza- 
tion, which optimistically avoids agreement on read re- 
quests, offers little or no benefit for most realistic work- 
loads. Improving the performance of read requests re- 
quires optimizing the execution of the reads themselves. 
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Unlike write requests, which modify service state and 
hence must be executed at each replica server, read re- 
quests can benefit from causality tracking. For example, 
if there are no causally-dependent writes between two 
identical reads, a replica server could simply cache the 
response of the first read and avoid the second read al- 
together.! However, this requires (1) knowledge of the 
causal dependencies of all write requests, and (2) a re- 
sponse cache of all prior reads at each replica server. The 
first requirement is unrealistic for many applications: a 
single write may modify the service state in complex 
ways. Even if we address this problem by invalidating 
the entire response cache upon receiving any write, the 
space needed by such a cache could be prohibitive: a 
cache of Facebook’s 60+ billion images on April 30, 
2009 [49], assuming a scant 1% working-set size, would 
occupy approximately 15TB of memory. Thus, the sec- 
ond requirement is also unrealistic. 

Instead of caching each response r, the replica servers 
can store a compact, collision-resistant sketch s(r) to en- 
able cache validation. That is, when a client issues a read 
request for r, only one replica server executes the read 
and replies with r, while the remaining replica servers 
reply with s(r) from their caches. The client accepts r 
only if the replica group agrees on s(r) and if s(r) vali- 
dates r. Thus, even if the replica that returns r is faulty, it 
cannot make the client accept arbitrary data; in the worst 
case, it causes the client to accept a stale version of r. 
Therefore we only need to ask one replica to execute the 
read, effectively implementing what we call a fast read. 
Fast reads drastically improve the throughput of read re- 
quests and can be load-balanced across the replica group 
to avoid repeated stale results. The replica servers main- 
tain a fresh cache by updating it during regular (repli- 
cated) reads, which are issued when fast reads fail. Us- 
ing a compact cache reduces the memory footprint of the 
Facebook image working set to less than 27GB. 

We call the resulting system Distributed Prophecy, or 
D-Prophecy, and call the consistency semantics it pro- 
vides delay-once consistency. 


2.2 BET Internet Services 


An oft-overlooked issue with BFT systems, including D- 
Prophecy, is that they are implicitly designed for services 
with long-running sessions between clients and replica 
servers (or at least always presented and evaluated as 
such). Clients establish symmetric session keys with 
each replica server, although the overhead of doing so 
is not typically included when calculating system perfor- 
mance. Figure | shows the throughput of the PBFT im- 


‘Other causality-based optimizations, such as client-side specula- 
tion [64] or server-side concurrent execution [37] are also possible, but 
are complementary to any cache-based optimizations. 
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Figure 1: PBFT’s throughput in the thousands of requests 
per second for null requests in sessions of varying length. 
Note that both axes are log scale. 


plementation as a function of session length, with all rel- 
evant optimizations enabled including the read optimiza- 
tion (indicated by ‘ro’). As sessions get shorter, through- 
put is drastically reduced because replicas need to de- 
crypt and verify clients’ new session keys. For PBFT 
sessions consisting of 128 read requests, throughput is 
half of its maximum, and for sessions consisting of 8 read 
requests, throughput is one-tenth of its maximum. 

The assumption of long-lived sessions breaks down 
for Internet services, however, which are mostly char- 
acterized by short-lived sessions and unmodified clients. 
These properties make it impractical for clients to es- 
tablish per-session keys with each replica. Moreover, 
depending on clients to perform protocol-specific tasks 
leads to poor backwards compatibility for legacy clients 
of Internet services (e.g., web browsers), where cryp- 
tographic support is not easily available [2]. Instead, 
we might turn to using an entity knowledgeable of the 
BFT protocol to proxy client requests to a service replica 
group. And since Internet services already rely on the 
correct operation of local middleboxes (at least with re- 
spect to service availability), we extend this reliance 
by converting the middlebox into a trusted proxy. The 
trusted proxy interfaces multiple short-lived sessions be- 
tween clients and itself with a single long-lived session 
between itself and the replica group, acting as a client in 
the traditional BFT sense. 

When using proxied client access to a D-Prophecy 
group, there is no need to maintain redundant caches at 
each replica server: a shared cache at the trusted proxy 
suffices, and it preserves delay-once consistency. A fast 
read now mimics the performance of an unreplicated 
read, as the proxy only asks one replica server for r and 
validates the response with its (local) copy of s(7). Since 
the cache is compact, the proxy remains a small and sim- 
ple trusted component, amenable to verification. We call 
this system Prophecy, and present its design in §3. 


2.3. Applications 


The delay-once semantics of Prophecy imply that faulty 
nodes can at worst return stale (not arbitrary) data. This 
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semantics is sufficient for a variety of applications. For 
example, Prophecy would be able to protect against the 
Facebook and Flickr mishaps mentioned in the intro- 
duction, because it would not allow arbitrary data to 
reach the client. Applications that serve inherently static 
(write-once) data are also good candidates, because here 
a “stale” response is as fresh as the latest response. In §8 
we demonstrate the propensity for static data in today’s 
most popular websites. 

Social networks and “Web 2.0” applications are good 
candidates for delay-once consistency because they typ- 
ically do not require all writes to be immediately visible. 
Consider the following example from Yahoo!’s PNUTS 
system [16]. A user wants to upload spring-break pho- 
tos to an online photo-sharing site, but does not want his 
mother to see them. So, he first removes her from the per- 
mitted access list of his database record and then adds the 
spring-break photos to this record. A consistency model 
that allows these updates to appear in different orders at 
different replicas, such as eventual consistency [22], is 
insufficient: it violates the user’s intention of hiding the 
photos from his mother. Delay-once consistency only 
allows stale data to be returned, not data out-of-order: 
if the photos are visible, then the access control update 
must have already taken place. Further, once the user 
has “refreshed” his own page and sees the photos, he is 
guaranteed that his friends will also see them. 

For applications where writes are critical, such as a 
bank account, delay-once consistency is appropriate be- 
cause it ensures that writes follow the protocol of the 
replica group. Although reads may return stale results, 
they can only do so in a limited way, as we discuss in 
$4. Prophecy limits the duration of staleness in practice 
using load balancing. On the other hand, there are some 
applications for which delay-once consistency is not ben- 
eficial, such as those that critically depend on reading the 
latest data (e.g., a rail signaling service), or those that re- 
turn non-deterministic content (e.g., a CAPTCHA gener- 
ator). 


3 System Design 


We first define a sketcher abstraction that lies at the heart 
of Prophecy. For a more traditional setting, we use this 
sketcher to design a distributed variant of Prophecy, or 
D-Prophecy. We then present the design of Prophecy. 


3.1 The Sketcher 


Prophecy and D-Prophecy use a sketcher to improve 
the performance of read requests to an existing replica 
group. A sketcher maintains a history table of compact, 
collision-resistant sketches of requests and responses 
processed by a replica group. Each entry in the history 
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Replica Group 


Figure 2: Executing a fast read in D-Prophecy. Only one 
replica server executes the read (bold line); the others re- 
turn the response sketch in the history table (dashed lines). 


table is of the form (s(q),s(r)), where g is a request, r 
is the response to g, and s is the sketching function used 
for compactness (s typically makes use of a secure hash 
function like SHA-1). The sketcher computes sketches 
and looks up or updates entries in the history table using 
a standard get/set interface, keyed by s(q). In Prophecy, 
only read requests and responses are stored in the history 
table. 

The specific use of the sketcher and its interaction 
with the replica group differs between Prophecy and D- 
Prophecy. However, both systems require the replica 
group to support the following request interface: 


¢ RESP < fast(REQ q) 
¢ (RESP r, SEQ_NO oO) < replicated(REQ q) 


We expect the fast interface to be new for most replica 
groups. The replicated interface should already exist, but 
may need to be extended to return sequence numbers. No 
modifications are made to the replica group beyond what 
is necessary to support the interfaces, in either system. 


3.2 D-Prophecy 


Figure 2 shows the system model of D-Prophecy. Ex- 
cept for the sketcher, all other entities are standard com- 
ponents of a replicated service: clients send requests to 
(and receive responses from) a service implemented by 
N replica servers, according to some replication proto- 
col like PBFT. Each replica server is augmented with a 
sketcher that maintains a history table for read requests. 
The history table is read by the fast interface and updated 
by the replicated interface, as follows. 

A client issues a fast read g by sending it to all replica 
servers and choosing one of them to execute g and re- 
turn r. The policy for selecting a replica server is un- 
specified, but a uniformly random policy has especially 
useful properties (see §4.2). The other replicas use their 
sketcher to lookup the entry for s(q) and return the cor- 
responding response sketch s(r), or null if the entry does 
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Figure 3: Prophecy mediating access to a replica group. 


not exist. If the client receives a quorum of non-null re- 
sponse sketches that match the sketch of the actual re- 
sponse, it accepts the response. The quorum size de- 
pends on the replication protocol; we give an example 
below. Otherwise, we say a transition has occurred and 
the client reissues the request as a replicated read. A 
replicated read is executed according to the protocol of 
the replica group, with one additional step: all replica 
servers use their sketcher to update the entry for s(q) 
with the new value of s(r), before sending a response 
to the client. 

Readers familiar with the PBFT protocol will notice 
that fast reads in D-Prophecy look very similar to PBFT 
optimized reads. However, there is a crucial difference: 
PBFT requires every replica server to execute the read, 
while D-Prophecy requires only one such execution, per- 
forming in-memory lookups of s(r) at the rest. For non- 
null workloads, this represents a significant performance 
improvement, as shown in §7. On the flip side, each 
replica server requires additional memory to store its his- 
tory table, though in practice this overhead is small. The 
quorum size required for fast reads 1s identical to the quo- 
rum size required for optimized reads: (2N + 1)/3 re- 
sponses suffices with some caveats (see §5.1.3 of [11]), 
and N always suffices. 

The architecture of D-Prophecy resembles that of a 
traditional BFT system: clients establish session keys 
with the replica servers and participate fully in the repli- 
cation protocol. As we observed in §2.2, this makes D- 
Prophecy unsuitable for Internet services, with their en- 
vironment of short-lived sessions and unmodified clients. 
This motivates the design of Prophecy, discussed next. 


3.3, Prophecy 


Figure 3 shows the simplest realization of Prophecy’s 
system model. (We consider extensions to the basic 
model in §5.) There are four types of entities: clients, 
sketchers, replica clients, and replica servers. Unmod- 
ified clients’ requests to a service are handled by the 
sketcher; together with the replica clients, this serves as 
the trusted proxy described in §2.2. The replica clients 
interact with the service, implemented by a group of N 
replica servers, according to some replication protocol. 


The sketcher issues each request through a replica 
client; the next subsection details the handling of re- 
quests. Functionally, the sketcher in Prophecy plays 
the same role as the per-replica-server sketchers in D- 
Prophecy. Architecturally, however, its role is quite dif- 
ferent. In Prophecy, a fast read is sent only to the single 
replica server that executes it, and neither the fast nor 
replicated interface accesses the history table directly. 
Thus, the replica group is treated as a black box. Since 
the sketcher is external to the replica group, writes pro- 
cessed by the group may no longer be visible or dis- 
cernible to the sketcher; i.e., there may exist an exter- 
nal write channel. Since only replica clients interact di- 
rectly with the replica servers, each replica client can 
maintain a single, long-lived session with each replica 
server. Wide-area clients are shielded from any churn 
in the replica group and are unaware of the replication 
protocol: the only responses they see are those that have 
already been accepted by the sketcher. 

The type of session used between clients and the 
sketcher is left open by our design, as it may vary from 
service to service. For example, services that only allow 
read or simple write operations (e.g., HTTP GETs and 
POSTs) may use unauthenticated sessions. A service like 
Facebook may use authentication only during user lo- 
gin, and use unauthenticated cookie-based sessions after 
that. Finally, services that store private or protected data, 
such as an online banking system, may secure sessions 
at the application level (e.g., using HTTPS). Prophecy’s 
architecture makes it easy to cope with the overhead of 
client-sketcher authentication, because one can simply 
add more sketchers if this overhead grows too high (see 
$5). To achieve the same scale-out effect, traditional 
BFT systems like PBFT and D-Prophecy would need to 
add entire replica groups. 


3.3.1 Handling a Request 


The sketcher stores two additional fields with each entry 
(s(q),s(v)) in the history table: the sequence number o 
associated with r, and a 2-bit value b indicating whether 
s(q) is whitelisted (always issued as a fast read), black- 
listed (always issued as a replicated request), or neither 
(the default). The sketch s(r) is empty for whitelisted or 
blacklisted requests. Algorithm 1 describes the process- 
ing of a request and is illustrated in Figure 3 (numbers on 
the right correspond to the numbered steps in the figure). 

Prophecy requires a sequence number to be returned 
by replicated, as it seeks to issue concurrent requests to 
the replica group using multiple replica clients. Con- 
currency allows reads to execute in parallel to improve 
throughput. Unfortunately, a sketcher that issues re- 
quests concurrently has no way of discerning the cor- 
rect order of replicated reads by itself, 7.e., the order they 
were processed by the replica group. Thus, it relies on 
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Algorithm 1 Processing a request at the sketcher. 


Receive request g from client (1) 


if g is aread request then 
(s(qg),5(r),0,b) — Lookup s(q) in history table 
if (s(r) A null) and (b ¥ blacklisted) then 


r’ — fast(q) (2) 
if (s(7’) = s(r)) or (b = whitelisted) then 
return 7’ to client (4) 
end if 
end if 
(r’, 0’) — replicated(q) (3) 


if (s(r) = null) or (o’ > o) then 
Update history table with (s(q),s(r’), 0’, b) 
end if 


else 

(r’, 0’) — replicated(q) (3) 
end if 
return 7’ to client (4) 


the sequence number returned by replicated to ensure 
that entries in the history table always reflect the latest 
system state. 


The sketcher requires some application-specific know- 
ledge of the format of g and r. This information is used 
to determine if g is a read or write request, and to dis- 
card extraneous or non-deterministic information from q 
or r while computing s(qg) or s(r). For example, in our 
prototype implementation of Prophecy, an HTTP request 
is parsed by an HTTP protocol handler to extract the 
URL and HTTP method of the request; the same handler 
removes the date/time information from HTTP headers 
of the response. In practice, the required application- 
specific knowledge is minimal and limited to parsing 
protocol headers; the payload of the request or response 
(e.g., the HTTP body) is treated opaquely by the sketcher. 

Whitelisting and blacklisting add flexibility to the han- 
dling of requests, but may require additional application- 
specific knowledge. One use of blacklisting that does 
not require such knowledge is to dynamically blacklist 
requests that exhibit a high frequency of transitions (e.g., 
dynamic content). This allows the sketcher to avoid is- 
suing fast reads that are very likely to fail. (We do not 
currently implement this optimization.) 


3.4 Performance 


In our analysis and evaluation, the sketcher is able to ac- 
commodate all read requests in its history table without 
evicting any entries. If needed, a replacement policy such 
as LRU may be used, but this is unlikely: our current im- 
plementation can store up to 22 million unique entries 
using less than 1GB of memory. 
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The performance savings of a sketcher come from the 
ability to execute fast, load-balanced reads whose re- 
sponses match the entries of the history table. Thus, 
Prophecy and D-Prophecy are most effective in read- 
mostly workloads. We can estimate the savings by look- 
ing at the cost, in terms of per-replica processing time, 
of executing a read in these systems. Let ¢ be the prob- 
ability that a state transition occurs in a given workload. 
Let Cr be the cost of a replicated read and C, the cost 
of a fast read (excluding any sketcher processing in the 
case of D-Prophecy), and let Cyjs, be the cost of com- 
puting a sketch and performing a lookup/update in a his- 
tory table. Below, we calculate the expected cost of a 
read in Prophecy and D-Prophecy when used with a BFT 
replica group that uses PBFT’s read optimization. For 
comparison, we include the cost of the unmodified BFT 
group; here, t’ is the probability that a PBFT optimized 
read fails. 


Prophecy: IC, + 2Chist| + [t(NCr + Crist )| 
D-Prophecy: [C,;-+ (N — 1)Chis¢] + [t(NCr + NChist)| 
BFT: [NC,] + |t’NCp| 


The addends on the left and right of each equation 
show the cost of a fast read and a replicated read, respec- 
tively. The equations do not include optimizations that 
benefit all systems equally, such as separating agreement 
from execution [67]. Prophecy performs two lookups in 
the history table during a fast read (one before and one 
after executing the read), and one update to the history 
table during a replicated read. D-Prophecy performs a 
history table lookup at all but one replica server during 
a fast read, and an update to the history table of each 
replica server during a replicated read. These equations 
show that Prophecy operates at maximum throughput 
when there are no transitions, because only one replica 
server processes each request, as compared to over 2/3 
of the replica servers in the BFT system (assuming, ideal- 
istically, that only a necessary quorum of replica servers 
execute the optimized read, and the remaining replicas 
ignore it). Since Cyjsy < C;, for non-null workloads— 
the former involves an in-memory table lookup, the latter 
an actual read—this is a factor of over (2/3)N improve- 
ment. D-Prophecy’s savings are similar for the same 
reason. Although t’ may be significantly less than ¢ in 
practice—given that PBFT optimized reads may still suc- 
ceed even when a state transition occurs—our evaluation 
in §7 reveals that the benefit of PBFT optimized reads 
over replicated reads is small for real workloads. Finally, 
while Prophecy’s throughput advantage degrades as t in- 
creases, we demonstrate in §8 that ¢ is indeed low for 
popular web services. 


USENIX Association 


USENIX Association 


4 Consistency Properties 


Despite their relatively simple designs, the consistency 
properties of Prophecy and D-Prophecy are only slightly 
weaker than those of the (unmodified) replica group. In 
this section, we formalize the notion of delay-once con- 
sistency introduced in §2. Delay-once consistency is a 
derived consistency model; here, we derive it from lin- 
earizability [32], the consistency model of most BFT 
protocols, and obtain delay-once linearizability. Then, 
we show how Prophecy implements delay-once lineariz- 
ability. 


4.1 Delay-once Linearizability 


A history of requests and responses executed by a ser- 
vice is linearizable if it is equivalent to a sequential his- 
tory [40] that respects the irreflexive partial order on re- 
quests imposed by their real-time execution [32]. Re- 
quest X precedes request Y in this order, written X < Y, 
if the response of X is received before Y is sent. Suppose 
one client sends requests (R“,W?,R°) to the service and 
another client sends requests (we ,R°,R!,W8 ), with par- 
tial order {R“ ~ R°,W& ~ R°}. Then a valid linearized 
history could look like the following: 


d b 
( o> Wi; , Wy, ‘RS, WE,RS). 


The R’s and W’s represent read and write requests, and 
subscripts represent the service state reflected in the re- 
sponse to each request (following [28]). In contrast to 
this history, the following is a valid delay-once lineariz- 
able history, though it is not linearizable: 


(Ro, Wi, We, 6S, WS, RS). 


Requests R° and R° have stale responses because they 
do not reflect the state update caused by sequentially 
precedent writes (note that the staleness of R°’s response 
is discernible to the issuing client, whereas the staleness 
of R°’s response is not). At a high level, a delay-once his- 
tory looks like a linearized history with reads that reflect 
the state of prior reads, but not necessarily prior writes. 
The manner in which reads can be stale is not arbitrary, 
however. Specifically, a history H is delay-once lineariz- 
able if the subsequence of write requests in H, denoted 
by H|w, satisfies linearizability, and if read requests sat- 
isfy the following property: 


Delay-once property. For each read request R, in 
H, let Ry and W, be the read and write request of 
maximal order in H such that Ry < R, and W; < Ry. 
Then either x = y or x = z. 


Delay-once linearizability implies both monotonic 
read and monotonic write consistency, but not read-after- 
write consistency. If ~y is the partial order of the history 


H, delay-once linearizability respects <7), but not <x, 
due to the possible presence of stale reads. 

The delay-once property ensures two things: first, 
reads never reflect state older than that of the latest read 
(they are only delayed to one stale state), and second, 
reads that are updated reflect the latest state immediately. 
Thus, a system that implements delay-once consistency 
is responsive. To verify if a read in a delay-once consis- 
tent history H is stale, one can check the following: 


Staleness indicator. Given a read request R, in H, 
let W, be the write request of maximal order in H 
such that W, < R,. R, is stale if and only if x < y. 


The staleness property explains why object-based sys- 
tems like web services fare particularly well with delay- 
once consistency. In these systems, state updates to one 
object are isolated from other objects, so staleness can 
only occur between writes and reads that affect the same 
object. 

The above derivation of delay-once consistency is 
based on linearizability, but derivations from other con- 
sistency models are possible. For example, a weaker 
condition called read-after-write consistency also yields 
meaningful delay-once semantics. 


4.2 Prophecy’s Consistency Semantics 


We now show that Prophecy implements delay-once lin- 
earizability when used with a replica group that guaran- 
tees linearizability, such as a PBFT replica group. A 
similar (but simpler) argument shows that D-Prophecy 
achieves delay-once linearizability, omitted here due to 
space constraints. 

Prophecy inherits the system and network model of the 
replica group. When used with a PBFT replica group, we 
assume an asynchronous network between the sketcher 
and the replica group that may fail to deliver messages, 
may delay them, duplicate them, or deliver them out-of- 
order. Replica clients issue requests to the replica group 
one at a time; requests are retransmitted until they are 
received. We do not make any assumptions about the or- 
ganization of the service’s state; for example, the service 
may be a monolithic replicated state machine [39, 58] 
or a collection of numerous, isolated objects [32]. The 
sketcher may process requests concurrently. We model 
this concurrency by allowing the sketcher to issue re- 
quests to multiple replica clients simultaneously; the or- 
der in which these requests return from replica clients is 
arbitrary. Updates to service state may not be discernible 
or visible to the sketcher—i.e., there may exist an ex- 
ternal write channel—as discussed in $3.3. We show 
that Prophecy achieves delay-once linearizability despite 
concurrent requests and external writers. 

Our analysis of Prophecy’s consistency requires a non- 
standard approach because it is the sketcher, not the 
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replica servers, that enforces this consistency, and be- 
cause fast reads are executed by individual replicas. In 
particular, we introduce the notion of an accepted his- 
tory. Let H; for 1 <i<WN be the history of all write 
requests executed by replica server i and all fast read re- 
quests executed by i that were accepted by the sketcher. 
Let R; be the history of all replicated read requests ac- 
cepted by the sketcher. An accepted history A; is the 
union of H; and R;, for each replica server i. The po- 
sition in A; of each replicated read in R; is well defined 
because all reads are accepted at a single location (the 
sketcher) and all replicated requests are totally ordered 
by linearizability. We claim that the accepted history A; 
is delay-once linearizable. 


To see this, observe that replicated requests satisfy 
linearizability because they follow the protocol of the 
replica group. The sketcher ensures that replicated reads 
update the history table according to this order by using 
the sequence numbers returned by the replicated inter- 
face. Further, the sketcher only accepts a fast read if it 
reflects the state of the latest replicated read. Since A; 
contains all replicated reads accepted by the sketcher (not 
just those accepted by 7), and since accepted fast reads 
never reflect new state, it follows that all fast reads in A; 
must satisfy the delay-once property. While A; may not 
contain all write requests accepted by the replica group 
(e.g., if 7 is missing an update), this only affects i’s abil- 
ity to participate in replicated reads, and does not violate 
delay-once linearizability. Thus, we conclude that A; 1s 
delay-once linearizable. 


Limiting staleness via load balancing. Stale responses 
are returned by faulty replica servers or correct replica 
servers that are out-of-date. We can easily verify if an 
accepted history contains stale responses by checking the 
staleness indicator defined in §4. 


To limit the number of stale responses, the fast in- 
terface dispatches fast reads from all clients uniformly 
at random over the replica servers.” Let g be the frac- 
tion of faulty or out-of-date replica servers currently in 
the replica group. If g is a constant, then g*, the prob- 
ability that k consecutive fast reads are sent to these 
servers, 1s exponentially decreasing. For BFT protocols, 
g < 2/3 assuming a worst-case scenario where the max- 
imum number of correct nodes are out-of-date. For a 
replica group of size 4, the probability that k > 6 is less 
than 1.6%. 


*We assume for simplicity that the random selection is secure, 
though in practice faulty replica servers may hamper this process. The 
latter is an interesting problem, but outside the scope of this paper. 
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5 Scale and Complex Architectures 


This section describes extensions to the basic Prophecy 
model in order to integrate fault tolerance into larger- 
scale and more complex environments. 


Scaling through multiple sketchers. In the basic sys- 
tem model of Prophecy (Figure 3), the sketcher is a single 
bottleneck and point-of-failure. We address this limita- 
tion by using multiple sketchers to build a sketching core, 
as follows. First, we horizontally partition the global 
history table, based on s(q)’s, into non-overlapping re- 
gions, e.g., using consistent hashing [33]. We assign 
each region to a distinct sketcher, which we refer to as re- 
sponse sketchers. The partitioning preserves delay-once 
semantics because only a single sketcher stores the en- 
try for each s(qg). Second, we build a two-level sketch- 
ing system as shown in Figure 4, where the first tier 
of request sketchers demultiplex client requests. That 
is, given a request g, any of a small number of request 
sketchers computes s(qg) and forwards g to the appropri- 
ate response sketcher. Using a one-hop distributed hash 
table (DHT) [27, 33] to manage the partitioning works 
well, given the network’s small, highly-connected na- 
ture. The response sketchers (the members of this DHT) 
issue requests to the replica group(s) and sketch the re- 
sponses, ultimately returning them to the clients. (Im- 
portantly, the replica servers in Figure 4 need not be part 
of a single replica group, but may instead be organized 
into multiple, smaller groups.) The larger number of re- 
sponse sketchers reflects the asymmetric bandwidth re- 
quirements of network protocols like HTTP. We evalu- 
ate the scaling benefits of multiple response sketchers in 
87.7. 


Handling sketcher failures. The sketching core han- 
dles failure and recovery of sketchers seamlessly, be- 
cause it can rely on the join and leave protocol of the 
underlying DHT. Since request sketchers direct client re- 
quests, they maintain the partitioning of the DHT. To 
preserve delay-once semantics, this partitioning must be 
kept consistent [10, 66] to avoid sending requests from 
the same region of the history table to multiple response 
sketchers. Prophecy’s support for blacklisting simplifies 
this task, however. In particular, whenever a region of 
the history table is being relinquished or acquired be- 
tween response sketchers, we can allow more than one 
response sketcher to serve requests from the same region 
provided the entire region is blacklisted (forcing all re- 
quests to be replicated). Once the partitioning has stabi- 
lized, the new owner of the region can unset the blacklist 
bit. As a result, membership dynamics can be handled 
smoothly and simply, at the cost of transient inefficiency 
but not inconsistency. 
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Figure 4: Scaling out Prophecy using multiple sketchers. 














OO) 


Clients 


























Mediating loosely-coupled groups. A sketching core 
can be shared by the multiple, loosely-coupled com- 
ponents that typically comprise a real service. Alter- 
natively, components that operate in parallel can use 
Prophecy via dedicated sketchers. Components that op- 
erate in series, such as multi-tier web services, can use 
Prophecy prior to each tier. However, applying agree- 
ment protocols in series introduces nontrivial consis- 
tency issues. We leave treatment of this problem to future 
work. 


6 Implementation 


Our implementation of Prophecy and D-Prophecy is 
based on PBFT [12]. We used the PBFT codebase given 
its stable and complete implementation, as well as newer 
results [6] showing its competitiveness with Zyzzyva and 
other recent protocols (much more so than was origi- 
nally indicated [38]). We implemented and compared 
three proxied systems (Prophecy, proxied PBFT with- 
out optimized reads, and proxied PBFT with optimized 
reads), as well as three non-proxied (“direct”) systems 
(D-Prophecy, PBFT without optimized reads, and PBFT 
with optimized reads). In our evaluation, we will com- 
pare proxied systems only with other proxied systems, 
and similarly for direct systems, as the architectures and 
assumptions of the two models are fundamentally differ- 
ent. The proxied systems do not authenticate communi- 
cation between clients and the sketcher, though they eas- 
ily can be modified to do so with equivalent overheads. 
We implemented a user-space Prophecy sketcher in 
about 2,000 lines of C++ code using the Tamer asyn- 
chronous I/O library [36]. The sketcher forks a pro- 
cess for each core in the machine (8 in our test clus- 
ter), and the processes share a single history table via 
shared memory. The sketcher interacts with PBFT 
replica clients through the PBFT library. The pool of 
replica clients available to handle requests is managed as 
a queue. The sketching function uses a SHA-1 hash [48] 
over parts of the HTTP header (for requests) and the en- 
tire response body (for responses). The proxied PBFT 
variants share the same code base as the sketcher, but do 


not perform sketching, issue fast reads, or create or use 
the history table. 

We modified the PBFT library in three ways: to add 
support for fast reads (about 20 lines of code), to return 
the sequence numbers (about 20 LOC), and to add sup- 
port for D-Prophecy (about 100 LOC). Additional modi- 
fications enabled the same process to use multiple PBFT 
clients concurrently (500 LOC), and modified the sim- 
ple server distributed with PBFT to simulate a webserver 
and allow “null” writes (SOO LOC), as null operations 
actually have 8-byte payloads in PBFT. We also wrote a 
PBFT client in about 1000 lines of C++/Tamer that can 
be used as a client in direct systems and as a replica client 
in proxied systems. 


7 Evaluation 


This section quantifies the performance benefits and 
costs of Prophecy and D-Prophecy, by characterizing 
their latency and throughput relative to PBFT under vari- 
ous workloads. We explore how the system’s throughput 
characteristics change when we modify a few key vari- 
ables: the processing time of the request, the size of the 
response, and the client’s session length. Finally, we ex- 
amine how Prophecy scales out in terms of replica group 
SIZe. 


7.1 Experimental Setup 


All of our experiments were run in a 25-machine clus- 
ter. Each machine has eight 2.3GHz cores and 8GB of 
memory, and all are connected to a 1Gbps switch. 

The proxied systems are labeled Prophecy, pr-PBFT 
(proxied PBFT), and pr-PBFT-ro (proxied PBFT with the 
read optimization). The direct systems are labeled D- 
Prophecy, PBFT, and PBFT-ro (PBFT with the read op- 
timization). Multicast and batching are not used in our 
experiments, as they do not impact performance when 
using read optimizations; all other PBFT optimizations 
are employed. Unless otherwise specified, all experi- 
ments used four replica servers, a single sketcher/proxy 
machine for the proxied systems, and a single client ma- 
chine. The proxied experiments used 40 replica clients 
across eight processes at the sketcher/proxy, and had 100 
clients establish persistent HTTP connections with the 
sketcher/proxy. The direct experiments used 40 clients 
across eight processes. These numbers were sufficient 
to fully saturate each system without degrading perfor- 
mance. All experiments use infinite-length sessions be- 
tween communicating entities (except for the one eval- 
uating the effect of session length). Throughput exper- 
iments were run for 30-second intervals and throughput 
was averaged over each second. 
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Table 2: Latency in is for serial null reads. 
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Figure 5: Throughput of null reads for proxied systems 
(Prophecy, pr-PBFT, and pr-PBFT-ro). 


In some experiments, we report numbers for 
Prophecy-X or D-Prophecy-X, which signifies that the 
systems experienced state transitions X% of the time. 


7.2 Null Workload 


Latency. Table 2 shows the median and 99th percentile 
latencies for 100,000 serial null requests sent by a single 
client. All systems displayed low latencies under Ims, al- 
though the proxied systems have higher latencies as each 
request must traverse an extra hop. Prophecy, pr-PBFT- 
ro, D-Prophecy, and PBFT-ro all avoid the agreement 
phase during request processing and thus have notably 
lower latency than their counterparts. Prophecy-100 and 
D-Prophecy-100 represent a worst-case scenario where 
every fast read fails and is reissued as a replicated read. 


Throughput. Figure 5 shows the aggregate throughput 
of the proxied systems for executing null requests. We 
achieve the desired transition ratio by failing that fraction 
of fast reads at the sketcher. 

Since replica servers can execute null requests 
cheaply, the sketcher/proxy becomes the system bot- 
tleneck in these experiments. Nevertheless, Prophecy 
achieves 69% higher throughput than pr-PBFT-ro due to 
its load-balanced fast reads, which require fewer pack- 
ets to be processed by replica servers. As the transi- 
tion ratio increases, however, Prophecy’s advantage de- 
creases because fewer fast reads match the history table. 
For example, when transitions occur 15% of the time—a 
representative ratio from our measurement study in §8— 
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Figure 6: Throughput of null reads for direct systems (D- 
Prophecy, PBFT, and PBFT-ro). 


Prophecy’s throughput is 7% lower than that of pr-PBFIT- 
ro. 

Figure 6 depicts the aggregate throughput of the direct 
systems. In this experiment, 40 clients across two ma- 
chines concurrently execute null requests. D-Prophecy’s 
throughput is 15% lower than PBFT-ro’s when there are 
no transitions, and 50% lower when there are 15% tran- 
sitions. D-Prophecy derives no performance advantage 
from its fast reads because the optimized reads of PBFT 
take no processing time, while D-Prophecy incurs ad- 
ditional overhead for sketching and history table oper- 
ations. 


7.3. Server Processing Time 


The previous subsection shows that when requests take 
almost no time to process, Prophecy improves through- 
put only by decreasing the number of packets at each 
replica server, while D-Prophecy fails to achieve bet- 
ter throughput. However, when the replicas perform 
real work, such as the computation or disk I/O associ- 
ated with serving a webpage, Prophecy’s improvement is 
more dramatic. 

Figures 7 and 8 demonstrate how varying processing 
time affects the throughput of proxied systems (normal- 
ized against pr-PBFI-ro) and direct systems (normal- 
ized against PBFI-ro), respectively. As the processing 
time increases—implemented using a busy-wait loop— 
the cost of executing requests begins to dominate the cost 
of agreeing on their order. This decreases the effective- 
ness of PBFT’s read optimization, as evidenced by the 
increase in pr-PBFT’s throughput relative to pr-PBFT- 
ro, and similarly between PBFT and PBFT-ro. At the 
same time, the higher execution costs dramatically in- 
crease the effectiveness of load balancing in Prophecy 
and D-Prophecy. Their throughput approaches 3.9 times 
the baseline, which is only 2.5% less than the theoretical 
maximum. 

The effectiveness of load-balancing is more pro- 
nounced in Prophecy than in D-Prophecy for two main 
reasons. First, Prophecy’s fast reads involve only one 
replica server, while D-Prophecy’s fast reads involve all 
replicas, even though only a single replica actually exe- 
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Figure 7: Throughput of proxied systems as processing 
time increases, normalized against pr-PBFT-ro. 
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Figure 8: Throughput of direct systems as processing time 
increases, normalized against PBFT-ro. 


cutes the request. Second, Prophecy performs sketching 
and history table operations at the sketcher, whereas D- 
Prophecy implements such functionality on the replica 
servers, stealing cycles from normal processing. 


7.4 Integration with Apache Webserver 


We applied Prophecy to a replica group in which each 
server runs the Apache webserver [7], appropriately 
modified to return deterministic results. Upon receiving 
a request, a PBFT server dispatches the request body to 
Apache via a persistent TCP connection over localhost. 


Figure 9 shows the aggregate throughput of the prox- 
ied systems for serving a l-byte webpage. When there 
are no transitions, Prophecy’s throughput is 372% that 
of pr-PBFT-ro. At the representative ratio of 15%, 
Prophecy’s throughput is 205% that of pr-PBFI-ro. The 
processing time of Apache is enough to dominate all 
other factors, causing Prophecy’s use of fast reads to sig- 
nificantly boost its throughput. 


Figure 10 shows the throughput of direct systems. 
With no transitions, D-Prophecy’s throughput is 265% 
that of PBFT-ro, and 141% when there are 15% transi- 
tions. 


In these experiments, the local HTTP requests to 
Apache took an average of 94s. For the remainder of 
this section, we use a simulated processing time of 94Us 
within replica servers when answering requests. 
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Figure 9: Throughput of reads of a 1-byte webpage to 
Apache webservers for proxied systems. 
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Figure 10: Throughput of concurrent reads of a 1-byte 
webpage to Apache webservers for direct systems. 


7.5 Response Size 


Next, we evaluate the proxied systems’ performance 
when serving webpages of increasing size, as shown by 
Figure 11. As the response size increases, fewer replica 
clients were needed to maximize throughput. At the 
same time, Prophecy’s throughput advantage decreases 
as the response size increases, as the sketcher/proxy 
becomes the bottleneck in each scenario. Increasing 
the replica servers’ processing time shifts this drop in 
Prophecy’s throughput to the right, as it increases the 
range of response sizes for which processing time 1s the 
dominating cost. Note that we only evaluate the systems 
up to 64KB responses, because PBFT communicates via 
UDP, which has a maximum packet size of 64KB. 


7.6 Session Length 


Our experiments with direct systems so far did not ac- 
count for the cost of establishing authenticated sessions 
between clients and replica servers. To establish a new 
session, the client must generate a symmetric key that 
it encrypts with each replica server’s public key, and 
each replica server must perform a public-key decryp- 
tion. Given the cost of such operations, the performance 
of short-lived sessions can be dominated by the overhead 
of session establishment, as we discussed in §2.2. 
Figure 12 demonstrates the effect of varying session 
length on the direct systems, in which each request per 
session returns a |-byte webpage. We find that the 
throughput of PBFT and PBFT-ro are indistinguishable 
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Figure 11: Throughput of proxied systems as response size 
increases, normalized against pr-PBFT-ro. 
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Figure 12: Throughput of direct systems as session length 
increases, normalized against PBFT-ro. 


for short sessions, but as session length increases, the 
cost of session establishment is amortized over a larger 
number of requests, and PBFT-ro gains a slight through- 
put advantage. Similarly, D-Prophecy achieves its full 
throughput advantage only when sessions are very long. 

We do not evaluate the effect of session lengths in the 
proxied systems, because they currently do not authenti- 
cate communication with the clients. Authentication can 
easily be incorporated into these systems, however, at a 
similar cost to Prophecy and pr-PBFT. That said, prox- 
ied systems can better scale up the maximum rate of ses- 
sion establishment than direct systems, as we observed 
in §3.3: each additional proxy provides a linear rate in- 
crease, while direct systems require an entire new replica 
group for a similar linear increase. 


7.7 Scaling Out 


Finally, we characterize the scaling behavior of Prophecy 
and proxied PBFT systems. By increasing the size of 
their replica groups, PBFT systems gain resilience to a 
greater number of Byzantine faults (e.g., from one fault 
per 4 replicas, to four faults per 13 replicas). However, 
their throughput does not increase, as each replica server 
must still execute every request. On the other hand, 
Prophecy’s throughput can benefit from larger groups, as 
it can load balance fast reads over more replica servers. 
As the sketcher can become a bottleneck in the system at 
higher read rates, we used two sketchers for a 7-replica 
group and three sketchers for a 10- and 13-replica group. 
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Figure 13: Throughput of Prophecy and pr-PBFT-ro with 
varying replica group sizes. 


Figure 13 shows the throughput of proxied systems for 
increasing group sizes. Prophecy’s throughput is 395%, 
739%, 1000%, and 1264% that of pr-PBFI-ro, for group 
sizes of 4, 7, 10, and 13 replicas, respectively. Prophecy 
does not achieve such a significant throughput improve- 
ment when experiencing transitions, however. We see 
that a 15% transition ratio prevents Prophecy from han- 
dling more than 32,000 req/s, which it achieves with a 
replica group of size 10. Thus, under moderate transition 
rates, further increasing the replica group size will only 
increase fault tolerance, not throughput. 


8 Measurement Study of Alexa 
Sites 


The performance savings of Prophecy are most pro- 
nounced in read-mostly workloads, such as those involv- 
ing DNS: of the 40K names queried by the ConfiDNS 
system [52], 95.6% of them returned the same set of IP 
addresses every time over the course of one day. In web 
services, it is less clear that transitions are rare, given the 
pervasiveness of so-called “dynamic content’. 

To investigate this dynamism, we collected data from 
the Alexa top 25 websites by scripting a Firefox browser 
to reload the main page of each site every 20 seconds 
for 24 hours on Dec. 29, 2008. Among the top sites 
were www.youtube.com, www.facebook.com, 
www.skyrock.com, www.yahoo.co.jp, and 
www.ebay.com.? The browser loads and executes 
all embedded objects and scripts, including embedded 
links, JavaScript, and Flash, with caching disabled. We 
captured all network traffic using the tcpflow utility [19], 
and then ran our HTTP parser and SHA-1-based sketch- 
ing algorithm to build a compact history of requests and 
responses, similar to the real sketcher. 

Our measurement results show that transitions are rare 
in most of the downloaded data. We demonstrate a clear 


>While one might argue that BFT agreement is overkill for many 
of the sites in our study, our examples in the introduction show that 
Heisenbugs and one-off misconfigurations can lead to embarrassing, 
high-profile events. Prophecy protects against these mishaps without 
the performance penalty normally associated with BFT agreement. 
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Figure 14: A CDF of requests over transition ratios. 
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Figure 15: A CDF over transition ratios of first-party vs. 
third-party URLs. 


divide between very static and very dynamic data, and 
use Rabin fingerprinting [55] to characterize the dynamic 
data. Finally, we isolate the results of individual geo- 
graphic “sites” using a CIDR prefix database. 


8.1 Frequency of Transitions 


For each unique URL requested during the experiment, 
we measured the ratio of state transitions over repeated 
requests. Figure 14 shows a CDF of unique URLs at 
different transition ratios. We separately plotted those 
URLs based on the number of requests sent to each one, 
given that embedded links generate a variable number 
of requests to some sites. (Where not specified, the 
minimum number of requests used is 25.) We see that 
roughly 50% of all data accessed is purely static, and 
about 90% of all requests have fewer than 15% state tran- 
sitions. These numbers confirmed our belief that most 
dynamic websites are actually dynamic compositions of 
very static content. The same graph scaled by the av- 
erage response size of each request yields very similar 
curves (omitted), suggesting that Figure 14 also reflects 
the total response throughput at each transition ratio. 

Figure 15 is the same plot as Figure 14 but divided 
into first-party URLs, or those targeted at an Alexa top 
website, and third-party URLs, or those targeted at other 
sites (given that first-party sites can embed links to other 
domains for image hosting, analytics, advertising, etc.). 
The graph shows that third-party content is much more 
static than first-party content, and thus third-party con- 
tent providers like CDNs and advertisers could benefit 
substantially from Prophecy. 


The results in this section are conservative for two rea- 
sons. First, they reflect a workload of only three requests 
per minute per site, when in reality there may be tens or 
hundreds of thousands of requests per minute. Second, 
many URLs—though not enough to cause space prob- 
lems in a real history table—saw only a few requests, but 
returned identical responses, suggesting that our HTTP 
parser was conservative in parsing them as unique URLs. 
An important characteristic of all of the graphs in this 
section is the relatively flat line across the middle: this 
suggests that most data is either very static or very dy- 
namic. The next section discusses dynamic data in more 
detail. 


8.2 Characterizing Dynamic Data 


Dynamic data degrades the performance of Prophecy be- 
cause it causes failed fast reads to be resent as repli- 
cated reads. Often, however, the amount of dynamism 
is small and may even be avoidable. To investigate this, 
we characterized the dynamism in our data by using Ra- 
bin fingerprinting to efficiently compare responses on e1- 
ther side of a transition. We divided each response into 
chunks of size 1K in expectation [47], or a minimum of 
20 chunks for small requests. 

Our measurements indicate that 50% of all transitions 
differ in at least 30% of their chunks, and about 13% 
differ in all of their chunks. Interestingly, the edit dis- 
tance of these transitions was much smaller: we deter- 
mined that 43% of all transitions differ by a single con- 
tiguous insertion, deletion, or replacement of chunks, 
while preserving at least half or no more than doubling 
the number of original chunks. By studying transitions 
with low edit distance, we can identify sources of dy- 
namism that may be refactorable. For example, a prelim- 
inary analysis of around 4,000 of these transitions (se- 
lected randomly) revealed that over half of them were 
caused by load-balancing directives (e.g., a number ap- 
pended to an image server name) and random identifiers 
(e.g., client IDs) placed in embedded links or parame- 
ters to JavaScript functions. In fact, most of the top-level 
pages we downloaded, including seemingly static pages 
like www.google.com, were highly dynamic for this 
exact reason. A more in-depth analysis is slated for fu- 
ture work. 


8.3. Site-Based Analysis 


A “site” represents a physical datacenter or cluster of ma- 
chines in the same geographic location. A single site 
may host large services or multiple services. Having 
demonstrated Prophecy’s ability to scale out in such en- 
vironments, we now study the potential benefit of de- 
ploying Prophecy at the sites in our collected data. To 
organize our data into geographic sites, we used forward 
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Figure 16: A CDF of URLs over transition ratios for all 
sites for which CIDR data was available. 


and reverse DNS lookups on each requested URL and 
matched the resulting IP addresses against a CIDR pre- 
fix database. (This database, derived from data supplied 
by Quova [54], included over 2 million distinct prefixes, 
and is thus significantly finer-grained than those provided 
by Route Views [57].) Requests that mapped to the same 
CIDR prefix were considered to be part of the same site. 
Figure 16 shows an overlay of the transition plots of each 
site. From the figure, a few sites serve very static data 
or very dynamic data only, but most sites serve a mix 
of very static and very dynamic data. All but one site 
(view.atdmt .com) show a clear divide between very 
static and very dynamic data. 


9 Related Work 


A large body of work has focused on providing strong 
consistency and availability in distributed systems. In the 
fail-stop model, state machine replication typically used 
primary copies and view change algorithms to improve 
performance and recover from failures [41, 50]. Quo- 
rum systems focused on tradeoffs between overlapping 
read and write sets [26, 31]. These protocols have been 
extended to malicious settings, both for Byzantine fault- 
tolerant replicated state machines [12, 42, 56], Byzan- 
tine quorum systems [1, 46], or some hybrid of both [17]. 
Modern approaches have optimized performance via var- 
ious techniques, including by separating agreement from 
execution [67], using optimistic server-side speculation 
on correct operation [38], reducing replication costs 
by optimizing failure-free operation [65], and allow- 
ing concurrent execution of independent operations [37]. 
Prophecy’s history table is motivated by the same as- 
sumption as this last approach—namely, that many op- 
erations/objects are independent and hence often remain 
static over time. 

Given the perceived cost of achieving strong consis- 
tency and a particular desire to provide “always-on” 
write availability, even in the face of partitions, a num- 
ber of systems opted for cheaper techniques. Several 
BFT replicated state machine protocols were designed 
with weaker consistency semantics, such as BFT2F [44], 
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which weakens linearizability to fork* consistency, and 
Zeno [59], which weakens linearizability to eventual 
consistency. Several filesystems were designed in a sim- 
ilar vein, such as SUNDR [45] and systems designed 
for disconnected [29, 35] or partially-connected oper- 
ation [51]. BASE [53] explored eventual consistency 
with high scalability and partition tolerance; the foil 
to database ACID properties. More recently, highly- 
scalable storage systems being built out within data- 
centers have also opted for cheaper consistency tech- 
niques, including the Google File System [25], Yahoo!’s 
PNUTS [16], Amazon’s Dynamo [18], Facebook’s Cas- 
sandra [20], eBay’s storage techniques [61], or the popu- 
lar approach of using Memcached [23] with a backend 
relational database. These systems take this approach 
partly because they view stronger consistency properties 
as infeasible given their performance (throughput) costs; 
Prophecy argues that this tradeoff is not necessary for 
read-mostly workloads. 


Recently, several works have explored the use of 
trusted primitives to cope with Byzantine behavior. 
A2M [13] prevents faulty nodes from lying inconsis- 
tently by using a trusted append-only memory primi- 
tive, and TrInc [43] uses a trusted hardware primitive 
to achieve the same goal. Chun et al. [14] introduced a 
lightweight BFT protocol for multi-core single-machine 
environments that runs a trusted coordinator on one core, 
similar in philosophy to Prophecy’s approach of extend- 
ing the trusted computing base to include the sketcher. 

Prophecy is unique in its application to customer- 
facing Internet services and its ability to load-balance 
read requests across a replica group while retaining good 
consistency semantics. Perhaps closest to Prophecy’s se- 
mantics is the PNUTS system [16], which supports a 
load-balanced read primitive that satisfies timeline con- 
sistency (all copies of a record share a common timeline 
and only move forward on that timeline). Delay-once lin- 
earizability is strictly stronger than timeline consistency, 
however, because it does not allow a client to see a copy 
of a record that is more stale than a copy the client has 
already seen (whereas timeline consistency does). 


There has been some work on using history as a con- 
sistency or security metric for particular applications. 
Aiyer et al. [4, 5] develop k-quorum systems that bound 
the staleness of a read request to one of the last k writ- 
ten values. Using Prophecy with a k-quorum system 
may be synergistic: Prophecy’s load-balanced reads are 
less costly than quorum reads, and k-quorum systems 
can protect against an adversarial scheduler that attempts 
to hamper Prophecy’s load balancing. The Farsite file 
system [3] uses historical sketches to validate read re- 
quests, but requires a lease-based invalidation protocol 
to keep sketches strongly consistent. The system modi- 
fies clients extensively and requires knowledge of causal 
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dependencies, violating Prophecy’s design constraints (if 
these constraints are ignored, then D-Prophecy can easily 
be modified to achieve the same consistency as Farsite). 
Pretty Good BGP [34] whitelists BGP advertisements 
whose new route to a prefix includes its previous orig- 
inating AS, while other routes require manual inspec- 
tion. ConfiDNS [52] uses both agreement and history to 
make DNS resolution more robust. It requires results to 
be static for a number of days and agreed upon by some 
number of recursive DNS resolvers. Perspectives [63] 
combines history and agreement in a similar way to ver- 
ify the self-signed certificates of SSH or SSL hosts on 
first contact. Prophecy can be viewed as a framework 
that leverages history and agreement in a general man- 
ner. 


10 Conclusions 


Prophecy leverages history to improve the throughput of 
Internet services by expanding the trusted middlebox be- 
tween clients and a service replica group, while provid- 
ing a consistency model that is very promising for many 
applications. D-Prophecy achieves the same benefits for 
more traditional fault-tolerant services. Our prototype 
implementations of Prophecy and D-Prophecy easily in- 
tegrate with PBFT replica groups and are demonstra- 
bly useful in scale-out topologies. Performance results 
show that Prophecy achieves 372% of the throughput of 
even the read optimized PBFT system, and scales lin- 
early as the number of sketchers increases. Our evalua- 
tion demonstrates the need to consider a variety of work- 
loads, not just null workloads as typically done in the lit- 
erature. Finally, our measurement study of the Internet’s 
most popular websites demonstrates that a read-mostly 
workload is applicable to web service scenarios. 
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Abstract 


We address the problem of collecting unique items in a large 
stream of information in the context of Intrusion Prevention 
Systems (IPSs). IPSs detect attacks at gigabit speeds and 
must log infected source IP addresses for remediation or 
forensics. An attack with millions of infected sources can 
result in hundreds of millions of log records when counting 
duplicates. If logging speeds are much slower than packet 
arrival rates and memory in the IPS is limited, scalable log- 
ging is a technical challenge. After showing that naive ap- 
proaches will not suffice, we solve the problem with a new 
algorithm we call Carousel. Carousel randomly partitions 
the set of sources into groups that can be logged without du- 
plicates, and then cycles through the set of possible groups. 
We prove that Carousel collects almost all infected sources 
with high probability in close to optimal time as long as 
infected sources keep transmitting. We describe details of 
a Snort implementation and a hardware design. Simula- 
tions with worm propagation models show up to a factor 
of 10 improvement in collection times for practical scenar- 
ios. Our technique applies to any logging problem with non- 
cooperative sources as long as the information to be logged 
appears repeatedly. 


1 Introduction 


With a variety of networking devices reporting events at in- 
creasingly higher speeds, how can a network manager ob- 
tain a coherent and succinct view of this deluge of data? 
The classical approach uses a sample of traffic to make be- 
havioral inferences. However, in many contexts the goal 
is complete or near-complete collection of information — 
MAC addresses on a LAN, infected computers, or mem- 
bers of a botnet. While our paper presents a solution to this 
abstract logging problem, we ground and motivate our ap- 
proach in the context of Intrusion Prevention Systems. 
Originally, Intrusion Detection Systems (IDSs) imple- 
mented in software worked at low speeds, but modern In- 
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trusion Prevention Systems (IPSs) such as the Tipping Point 
Core Controller and the Juniper IDP 8200 [5] are imple- 
mented in hardware at 10 Gbps and are standard in many 
organizations. IPSs have also moved from being located 
only at the periphery of the organizational network to be- 
ing placed throughout the organization. This allows IPSs 
to defend against internal attacks and provides finer granu- 
larity containment of infections. Widespread, cost-effective 
deployment of IPSs, however, requires using streamlined 
hardware, especially if the hardware is to be integrated into 
routers (as done by Cisco and Juniper) to further reduce 
packaging costs. By streamlined hardware, we mean ide- 
ally a single chip implementation (or a single board with 
few chips) and small amounts of high-speed memory (less 
than 10 Mbit). 

Figure 1 depicts a logical model of an IPS for the pur- 
poses of this paper. A bad packet arrives carrying some 
key. Typically the key is simply the source address, but other 
fields such as the destination address may also be used. For 
the rest of the paper we assume the key 1s the IP source ad- 
dress. (We assume the source information is not forged. Any 
attack that requires the victim to reply cannot use a forged 
source address.) The packet is coalesced with other pack- 
ets for the same flow if it is a TCP packet, normalized [16] 
to guard against evasions, and then checked for whether 
the packet is indicative of an attack. The most common 
check is signature-based (e.g., Snort [13]) which determines 
whether the packet content matches a regular expression in 
a database of known attacks. However, the check could also 
be behavior-based. For example, a denial of service attack 
to a destination may be detected by some state accumulated 
across a Set of past packets. 

In either case, the bad packet is typically dropped, but the 
IPS is required to Jog the relevant information on disk at a re- 
mote management console for later analysis and reporting. 
The information sent is typically the key K plus a report 
indicating the detected attack. Earlier work has shown tech- 
niques for high speed implementations of reassembly [4], 
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Figure 1: IPS logical model including a logging component 
that is often implemented naively 


normalization [15, 16], and fast regular expression match- 
ing (e.g., [12]). However, to the best of our knowledge, 
there is no prior work in scalable logging for IPS systems 
or networking. 

To see why logging may be a bottleneck, consider 
Figure 2, which depicts a physical model of a streamlined 
hardware IPS implementation, either stand-alone or pack- 
aged in a router line card. Packets arrive at high speed 
(say 10 Gbps) and are passed from a MAC chip to one or 
more IDS chips that implement detection by for example 
signature matching. A standard logging facility, such as in 
Snort, logs a report each time the source sends a packet that 
matches an attack signature and writes it toa memory buffer, 
from which it is written out later either to locally attached 
disk in software implementations or to a remote disk at a 
management station in hardware implementations. A prob- 
lem arises because the logging speed is often much slower 
than the bandwidth of the network link. Logging speeds 
less than 100 Mbps are not uncommon, especially in 10 
Gbps IDS line cards attached to routers. Logging speeds are 
limited by physical considerations such as control proces- 
sor speeds and disk bandwidths. While logging speeds can 
theoretically be increased by striping across multiple disks 
or using a network service, the increased costs may not be 
justified in practice. 

In hardware implementations where the memory buffer is 
necessarily small for cost considerations, the memory can 
fill during a large attack and newly arriving logged records 
may be dropped. A typical current configuration might in- 
clude only 20 Mbits of on-chip high speed SRAM of which 
the normalizer itself can take 18 Mbits [16]. Thus, we as- 
sume that the logger may be allocated only a small amount 
of high speed memory, say 1 Mbit. Note that the memory 
buffer may include duplicate records already in the buffer or 
previously sent to the remote device. 

Under a standard naive implementation, unless the log- 
ging rate matches the arrival rate of packets, there is no 
guarantee that all infected sources will be logged. It is easy 
to construct worst-case timing patterns where some set of 
sources A are never logged because another set of sources B 
always reaches the IDS before sources in the set A and fills 
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Figure 2: IPS hardware model in which we propose adding 
a scalable logger facility called Carousel. Carousel focuses 
on a small random subset of the set of keys at one time, 
thereby matching the available logging speed. 


the memory. Even in a random arrival model, intuitively as 
more and more sources are logged, it gets less and less prob- 
able that a new unique source will be logged. In Section 3 
we show that, even with a fairly optimistic random model, 
a standard analysis based on the coupon collector’s prob- 
lem (e.g., [8]) shows that the expected time to collect all NV 
sources 1s a multiplicative factor of In N worse than the op- 
timal time. For example, when JN is in the millions, which 
is not unusual for a large worm, the expected time to col- 
lect all sources can be 15 times larger than optimal. We 
also show similar poor behavior of the naive implementa- 
tion, both through analysis and simulation, in more complex 
settings. 

The main contribution of this paper, as shown in Figure 2, 
is a scalable logger module that interposes between the de- 
tection logic and the memory buffer. We refer to this module 
and the underlying algorithm as Carousel, for reasons that 
will become apparent. Our logger is scalable in that it can 
collect almost all NV sources with high probability with very 
small memory buffers in close to optimal time, where here 
the optimal time is V/b with b being the logging speed. Fur- 
ther, Carousel is simple to implement in hardware even at 
very high speeds, adding only a few operations to the main 
processing path. We have implemented Carousel in software 
both in Snort as well as in simulation in order to evaluate its 
performance. 

While we focus on the scalable logging problem for IPSs 
in this paper, we emphasize that the problem is a general 
one that can arise in a number of measurement settings. For 
example, suppose a network monitor placed in the core of an 
organizational network wishes to log all the IP sources that 
are using TCP Selective Acknowledgment option (SACK). 
In general, our mechanism applies to any monitoring setting 
where a source is identified by a predicate on a packet (e.g., 
the packet contains the SACK-_PERMITTED option, or the 
packet matches the Slammer signature), memory is limited, 
and sources do not cooperate with the logging process. It 
does, however, require sources to keep transmitting packets 
with the predicate in order to be logged. Thus Carousel does 
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Figure 3: Abstract logging model: N keys to be logged en- 
ter the logging device repeatedly at a speed B that is much 
greater then the logging speed 6 and in a potentially adver- 
sarial timing pattern. At the same time, the amount of mem- 
ory M/ is much less than the NV, number of distinct keys to 
be logged. Source cooperation is not assumed. 


not guarantee the logging of one-time events. 

The rest of the paper is organized as follows. In Section 2 
we describe a simple abstract model of the scalable logging 
problem that applies to many settings. In Section 3 we de- 
scribe a simple analytical model that shows that even with 
an optimistic random model of packet arrivals, naive log- 
ging can incur a multiplicative penalty of In N in collection 
times. Indeed, we show this is the case even if naive logging 
is enhanced with a Bloom filter in the straightforward way. 
In Section 4 we describe our new scalable logging algorithm 
Carousel, and in Section 5 we describe our Snort implemen- 
tation. We evaluate Carousel using a simulator in Section 6 
and using a Snort implementation in Section 7. Our eval- 
uation tests both the setting of our basic analytical model, 
which assumes that all sources are sending at time 0, and 
a more realistic logistic worm propagation model, in which 
sources are infected gradually. Section 8 describes related 
work while Section 9 concludes the paper. 


2 Model 


The model shown in Figure 3 abstracts the scalable logging 
problem. First, there are NV distinct keys that arrive repeat- 
edly and with arbitrary timing frequency at a cumulative 
speed of 6B keys per second at the logger. There are two 
resources that are in scarce supply at the logger. First, there 
is a limited logging speed b (keys per second) that is much 
smaller than the bandwidth 6 at which keys arrive. Even 
this might not be problematic if the logger had a memory M@ 
large enough to hold all the distinct keys N that needed to 
be logged (using methods we discuss below, such as Bloom 
filters [1, 3], to handle duplicates), but in our setting of large 
infections and hardware with limited memory, we must also 
assume that VN >> M. 

Eliminating all duplicates before transmitting to the sink 
is not a goal of a scalable logger. We assume that the sink 
has a hash table large enough to store all N unique sources 
(by contrast to the logger) and eliminate duplicates. 

Instead, the ultimate goal of the scalable logger is near- 
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complete collection: the logging of all NV sources. We now 
adopt some of the terminology of competitive analysis [2] to 
describe the performance of practical logger systems. The 
best possible logging time Tontimai for an omniscient algo- 
rithm is clearly N/b. We compare our algorithms against 
this omniscient algorithm as follows. 


Definition 2.1 We say that a logging algorithm is (€,c)- 
scalable if the time to collect at least (1—«€)N of the sources 
is at moSt CL optimal. In the case of a randomized algo- 
rithm, we say that an algorithm is (€, c)-scalable if in time 
CToptimal the expected number of sources collected is at 
least (1 —)N. 

Note that in the case € = 0 all sources are collected. While 
obviously collecting all sources is a desirable feature, some 
relaxation of this requirement can naturally lead to much 
simpler algorithms. 

These definitions have some room for play. We could in- 
stead call a randomized algorithm (e, c)-scalable if the ex- 
pected time to collect at least (1 — €) N is at most cT optimal, 
and we may be concerned only with asymptotic algorithmic 
performance as either or both of N/M and B/b grow large. 
As our focus here is on practically efficient algorithms rather 
than subtle differences in the definitions we avoid such con- 
cerns where the meaning is clear. 

The main goal of this paper is to provide an effective and 
practical (€, c)-scalable randomized algorithm. To empha- 
size the value of this result, we first show that simple naive 
approaches are not (€, c)-scalable for any constants €,c > 0. 
Our positive results will require the following additional as- 
sumption for our model: 

Persistent Source Assumption: We assume that any dis- 
tinct key X to be logged will keep arriving at the logger. 

For sources infected by worms this assumption is of- 
ten reasonable until the source is “disinfected” because the 
source continues to attempt to infect other computers. The 
time for remediation (days) is also larger than the period 
in which the attack reaches its maximum intensity (hours). 
Further, if a source is no longer infected, then perhaps it mat- 
ters less that the source is not logged. In fact, we conjecture 
that no algorithm can solve the scalable logging problem 
without the Persistent Source assumption. 

The abstract logger model is a general one and applies to 
other settings. In the introduction, we mentioned one other 
possibility, logging sources using SACK. As another exam- 
ple, imagine a monitor that wishes to log all the sources in 
a network. The monitor issues a broadcast request to all 
sources asking them to send a reply with their ID. Such mes- 
sages do exist, for example the SYSID message in 802.1. 
Unfortunately, if all sources reply at the same time, some 
set of sources can consistently be lost. 

Of course, if the sources could randomize their replies, 
then better guarantees can be made. The problem can be 
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viewed as one of congestion control: matching the speed 
of arrival of logged keys to the logging speed. Congestion 
control can be solved by standard methods like TCP slow 
start or Ethernet backoff if sources can be assumed to co- 
operate. However, in a security setting we cannot assume 
that sources will cooperate, and other approaches, such as 
the one we provide, are needed. 


3 Analysis of a Naive Logger 


3.1 The Naive Logger Alone 


Before we describe our scalable logger and Snort imple- 
mentation, we present a straw man naive logger, and a theo- 
retical analysis of the expected and worst-case times. The 
theoretical analysis makes some simplifications that only 
benefit the naive logger, but still its performance is poor. 
The naive logger motivates our approach. 

We start with a model of the naive logger shown in 
Figure 4. We assume that the naive logger only has a mem- 
ory buffer in the form of a queue. Keys, which again are 
usually source addresses, arrive at a rate of B per second. 
When the naive logger receives a key, it is placed at the tail 
of the queue. If the queue is full, the key is dropped. The 
size of the queue is M. Periodically, at a smaller rate of b 
keys per second, the naive logger sends the key (and any as- 
sociated report) at the head of the queue to a disk log. Let 
Lp denote the set of keys logged to disk, and Ly, the set of 
keys that are in the memory. 

The naive logger works very poorly in an adversarial set- 
ting. In an adversarial model, after the queue is full of 7 
keys, and when an empty slot opens up at the tail, the adver- 
sary picks a duplicate key that is part of the / keys already 
logged. When the queue is full, the adversary cycles through 
the remaining unique sources to pick them to arrive and be 
dropped, thus fulfilling the persistent source assumption in 
which every source must arrive periodically. It is then easy 
to see the following result. 


Theorem 3.1 Worst-case time for naive logger: The 
worst-case time to collect all N keys is infinity. In fact, the 
worst-case time to collect more than M keys is infinite. 


We believe the adversarial models can occur in real situ- 
ations especially in a security setting. Sources can be syn- 
chronized by design or accident so that certain sources al- 
ways transmit at certain times when the logger buffers are 
full. While we believe that resilience to adversarial models 
is one of the strengths of Carousel, we will show that even in 
the most optimistic random models, Carousel significantly 
outperforms a naive logger. 

The simplest random model for key arrival is one in which 
the next key to arrive is randomly chosen from the N possi- 
ble keys, and we can find the expected collection time of the 
naive logger in this setting. 
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Figure 4: Model of naive logging using an optimistic ran- 
dom model. When space opens up in the memory log, a 
source is picked uniformly and randomly from the set of all 
possible NV sources. Unfortunately, that source may already 
be in the memory log (Ly) or in the disk log (Lp). Thus as 
more sources are logged it gets increasing less probable that 
a new unique source will be logged, leading to a logarithmic 
increase in collection time over optimal 
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Figure 5: Portion of timeline for random model shown in 
Figure 4. We divide time into cycles of time 7’ where T’ 
is the time to send one piece of logged information at the 
logging rate 6. The time for a new randomly chosen source 
to first arrive is much smaller t = 1/B, where B is the faster 
packet arrival rate. 


Let us assume that M < B/b, so that initially the queue 
fills entirely before the first departure. (The analysis is eas- 
ily modified if this is not the case.) Figure 5 is a timeline 
which shows that the dynamics of the system evolve in cy- 
cles of length T’ seconds, where T' = 1/b. Every T’ seconds 
the current head of the memory queue leaves for the disk 
log, and within the smaller time t = 1/B, a new randomly 
selected key arrives to the tail of the queue. In other words, 
the queue will always be full except when a key leaves from 
the head, leaving a single empty slot at the tail as shown 
in Figure 4. The very next key to be selected will then be 
chosen to fill that empty slot as shown in Figure 5. 

The analysis of this naive setting now follows from a stan- 
dard analysis of the coupon collector’s problem [8]. Let 
L = Ly U Lp denote the set of unique keys logged in 
either memory or disk. Let 7; denote the time for L to grow 
from size 2 — 1 to 2 (in other words, the time for the 2-th 
new key to be logged). If we optimistically assume that the 
first MV keys that arrive are distinct, we have 7; = T for 
1 <2 < M, as the queue initially fills. Subsequently, since 
the newly arriving key is chosen randomly from the set of NV 
keys, it will get increasingly probable (as 7 gets larger) that 
the chosen key already belongs to the logged set L. 

The probability that a new key will not be a duplicate of 
1 — 1 previously logged keys is is P; = (N —1+1)/N. If 
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a key is a duplicate the naive logger simply wastes a cycle 
of time T’. (Technically, it might be T — t where t = 1/B, 
but this distinction is not meaningful and we ignore it.) The 
expected number of cycles before the 2-th key is not a dupli- 
cate is the reciprocal of the probability or 1/P;. Hence for 
1 > M,i < N the expected value of T; is 


N 


El) = FN a4 1) 


Using the linearity of expectation, the collection time for 
the last VN — M keys is 


ny N=M 
on WAST ee 


I—=~M+1 7=1 


1 =N 
job 
using the well-known result for the sum of the harmonic 
series. Hence if we let T'"4/"¢ be the time i collect all NV 
keys for the naive collector then T@4?°¢ > 2 In(N — M), 
and so the naive logger is a multiplicative eu of In(NV — 
M) worse than the optimal algorithm. 

It might be objected that it is not clear that N/b is in 
fact the optimal time in this random model, and that this 
In N factor is due entirely to the embedded coupon collec- 
tor’s problem arising from the random model. For exam- 
ple, if B = b = 1, then you cannot collect the N keys in 
time NV, since they will not all appear until after approxi- 
mately N In N keys have passed [8]. However, as long as 
B/b>I\nN (and M > 1), for any y > 0, with high proba- 
bility an omniscient algorithm will be able to collect all keys 
after at most (1 + y) NB/b keys have passed in this random 
model, so the optimal collection time can be made arbitrar- 
ily close to N/b. Hence, this algorithm is indeed not truly 
scalable in the sense we desire, namely in a comparison with 
the optimal omniscient algorithm. 

Even if we seek only to obtain (1—«€) N keys, by the same 
argument we have the collection time is 


N 
b 
Hence when M = o(N), the logger is still not (€,c)- 


scalable for any constants « and c. We can summarize the 
result as follows: 


(In((1 —«)N — M) + O(1)). 


Theorem 3.2 Expected time for naive logger: The ex- 
pected time to collect (1 — €)N keys is at least a multiplica- 
tive factor of \n((1—«)N — M) worse than the optimal time 
for sufficiently large N, M, and ratios B/b. 

As stated in the introduction, for large worm outbreaks, 
the naive logger can be prohibitively slow. For example, 
as In 1,000,000 is almost 14, if the optimal time to log 1 
million sources is 1 hour, the naive logger will take almost 
14 hours. 

The results for the random model can be extended to situ- 
ations that naturally occur in practice and appear somewhere 
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(In(N — M) + O(1)), 


between the random model and an adversarial model. For 
example, suppose that we have two sets of sources, of sizes 
N, and No, but the first source sends at a speed that is 7 
times the second. This captures, at a high level, the issue that 
sources may be sending at different rates. We assume each 
source individually behaves according to the random model. 
Let 7; be the expected time to collect all the keys in the fast 
set, and 75 the expected time for the slow set. Then clear the 
expected time to collect all sources is at least max(T}, T>2), 
and indeed this lower bound will be quite tight when 7; and 
T> are not close. As an example, suppose NV; = =) 2, 
and 7 > 1. Then 75 is approximately 


NG +1) 1 N M 

2b 2 g4+1/)- 
The time to collect in this case is dominated by the slow 
sources, and is still a logarithmic factor from optimal. 


3.2 The Naive Logger with a Bloom Filter 

A possible objection is that our naive logger is far too 
naive. It may be apparent to many readers that additional 
data structures, such as a Bloom filter, could be used to pre- 
vent logging duplicate sources and improve performance. 
This is true, and we shall use such measures in our scalable 
approaches. However, we point out that as the Bloom filter 
of limited size, it cannot by itself prevent the problems of 
the naive logger, as we now explain. 

To frame the discussion, consider 1 million infected 
sources that keep sending to an IPS. The solution to the 
problem may appear simple. First, since all the sources may 
arrive at a very fast rate of B before even a few are logged, 
the scheme must have a memory buffer that can hold keys 
waiting to be logged. Second, we need a method of avoiding 
sending duplicates to the logger, specifically one that takes 
small space, in order to make efficient use of the small speed 
of the logger. 

To avoid sending duplicates, one naturally would think 
of a solution based on Bloom filters or hashed fingerprints. 
(We assume familiarity with Bloom filters, a simple small- 
space randomized data structure for answering queries of 
the form “Is this an item in set X” for a given set X. See [3] 
for details.) For example, we could employ a Bloom filter 
as follows. For concreteness, assume that a source address 
is 32 bits, the report associated with a source is 68 bits, and 
that we use a Bloom filter [1] of 10 bits per source.! Thus we 
need a total of 100 bits of memory for each source waiting to 
be logged, and 10 bits for each source that has been logged. 
(Instead of a Bloom filter, we could keep a table of hash- 
based fingerprints of the sources, with different tradeoffs but 
similar results, as we discuss in Section 4.2.2.) 


'This is optimistic because many algorithms would require not just a 
Bloom filter but instead a counting Bloom filter [7] to support deletions, 
which would require more than 10 bits per entry. 
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Unfortunately, the memory buffer and Bloom filter have 
to operate at Gigabit speeds. Assume that the amount of 
IDS high speed memory is limited to storing say | Mbit. 
Then, assuming 100 bits per source, the IPS can only store 
information about a burst of 10,000 sources pending their 
transmission to a remote disk. This does not include the size 
of the Bloom filter, which can only store around 100,000 
sources if scaled to | Mbit of size; after this point, the false 
positive rate starts increasing significantly. In practice one 
has to share the memory between the sources and the Bloom 
filter. 

The inclination would be to clear the Bloom filter after it 
became full and start a second phase of logging. One con- 
cern is that timing synchronization could result in the same 
sources that were logged in phase | being logged and filling 
up the Bloom filter again, and this could happen repeatedly, 
leading to missing several sources. Even without this poten- 
tial problem, there is danger in using a Bloom filter, as we 
can see by again considering the random model. 

Consider enhancing the naive logger with a Bloom filter 
to prevent the sending of duplicates. We assume the Bloom 
filter has a counter to track the number of items placed in 
the filter, and the filter is cleared when the counter reaches 
a threshold F’ to prevent too many false positives. Between 
each clearing, we obtain a group of F’ distinct random keys, 
but keys may be appear in multiple groups. Effectively, this 
generalizes the naive logger, which simply used groups of 
size Ff’ = 1. 

Not surprisingly, this variation of the coupon collector’s 
problem has been studied; it is know as the coupon sub- 
set collection problem, and exact results for the problem 
are known [11, 14]. Details can be examined by the in- 
terested reader. A simple analysis, however, shows that 
for reasonable filter sizes F’, there will be little or no gain 
over the naive logger. Specifically, suppose F’ = o( VN » 
Then in the random model, the well-known birthday para- 
dox implies that with high probability the first F’ keys to 
be placed in the Bloom filter will be distinct. While there 
may still be false positives from the Bloom filter, for such 
F the filter fills without detecting any true duplicates with 
high probability. Hence, in the random case, the expected 
collection time even using a Bloom filter of this size is still 
*In(N — M) + O(1). With larger filters, some true du- 
plicates will be suppressed, but one needs very large filters 
to obtain a noticeable gain. The essential point of this ar- 
gument remains true even in the setting considered above 
where different sets of sources arrive at different speeds. 

The key problem here is that we cannot supply the IDS 
with the list of all the sources that have been logged, even 
using a Bloom filter or a hashed set of fingerprints. Indeed, 
when VM << WN no data structure can track a meaningful 
fraction of the keys that have already been stored to disk. 
Our solution to this problem is to partition the population of 
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keys to be recorded into subsets of the right size, so that the 
logger can handle each subset without problem. The log- 
ger then iterates through all subsets in phases, as we now 
describe. This repeated cycling through the keys is reminis- 
cent of a Carousel, yielding our name for our algorithm. 


4 Scalable logging using Carousel 


4.1 Partitioning and logging 

Our goal is to partition the keys into subsets of the right 
size, so that during each phase we can concentrate on a sin- 
gle subset. The question is how to perform the partitioning. 
We want the size of each partition to be the right size for 
our logger memory, that is approximately size M/. We sug- 
gest using a randomized partition of the sources into subsets 
using a hash function that uses very little memory and pro- 
cessing. This randomized partitioning would be simple if 
we initially knew the population size NV, but that generally 
will not be the case; our system must find the current popu- 
lation size NV, and indeed should react as the population size 
changes. 

We choose a hash-based partition scheme that is particu- 
larly memory and time-efficient. Let H(X ) be a hash func- 
tion that maps a source key X to an r-bit integer. Let H;,(X ) 
be the lower order k bits of H(X ). The size of the partition 
can be controlled by adjusting k. 

For example, if k = 1, we divide the sources into two 
subsets, one subset whose low order bit (after hashing) is 1, 
and one whose lower order bit is a O. If the hash function 1s 
well-behaved, these two sets will be approximately half the 
original size N. Similarly, & = 2 partitions the sources ap- 
proximately into four equally sized subsets whose hash val- 
ues have low order bits 00, 01, 10, and 11 respectively. This 
allows only very coarse-grained partitioning, but that is gen- 
erally suitable for our purposes, and the simplicity of using 
the lower order k bits of H(X) is particularly compelling 
for implementation and analysis. To begin we will assume 
the population size is stable but unknown, in which case the 
basic Carousel algorithm can be outlined as follows: 


e Partition: Partition the population into groups of size 
2" by placing all sources which have the same value of 
H;,(X ) in the same partition. 


e Iterate: A phase is assigned time Tppase = M/b which 
is the time to log M sources, where M is the avail- 
able memory in keys and 6 is the logging time. The 
a-th phase is defined by logging only sources such that 
H;,(s) = 7. Other sources are automatically dropped 
during this phase. The algorithm must also utilize 
some means of preventing the same source from being 
logged multiple times in the phase, such as a Bloom 
filter or hash fingerprints. 
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e Monitor: If during phase 7, the number of keys that 
match H;,() = 7 exceeds a high threshold, then we 
return to the Partition step and increase k. While our 
algorithms typically use k = & + 1, higher jumps can 
allow faster response. If the number of number of keys 
that match H;,() = i falls below a low threshold, then 
we return to the Partition step and decrease k. 


In other words, Carousel initially tries to log all sources 
without hash partitioning. If that fails because of mem- 
ory overflow, the algorithm then works on half the possi- 
ble sources in a phase. If that fails, it works on a quarter 
of the possible sources, and so on. Once it determines the 
appropriate partition size, the algorithm iterates through all 
subsets to log all sources. 

As described, we could in the monitoring stage change 
k by more than | if our estimate of the number of keys 
seen during that phase suggests that would be an appropri- 
ate choice. Also, of course, we can choose to decrease k 
if our estimate of the keys in that phase is quite small, as 
would happen if we are logging suspected virus sources and 
these sources are stopped. There are many variations and 
optimizations we could make, and some will be explored in 
our experiments. The important idea of Carousel, however, 
is to partition the set of keys to match the logger memory 
size, updating the partition as needed. 


4.2 Collection Times for Carousel 

We assume that the memory includes, for each key to 
be recorded, the space for the key itself, the correspond- 
ing report, and some number of bits for a Bloom filter. 
This requires slightly more memory space that we assumed 
when analyzing the random model, where we did not use 
the Bloom filter. The discrepancy is small, as we expect 
the Bloom filter to be less than 10% of the total memory 
space (on the order of 10 bits or less per item, against 100 
or more bits for the key and report). This would not effec- 
tively change the lower bounds on performance of the naive 
logger. We generally ignore the issue henceforth; it should 
be understood that the Bloom filter takes a small amount of 
additional space. 

Recall that Carousel has 3 components: partition, iterate, 
and monitor. Faced with an unknown population N, the 
scalable logger will keep increasing the number of bits cho- 
sen k until each subset is less than size M/, the memory size 
available for buffering logged keys. 

We sketch an optimistic analysis, and then correct for the 
optimistic assumptions. Let us assume that all NV keys are 
present at the start of time, that our hash function splits the 
keys perfectly equally, and that there is no failed recording 
of keys due to false positives from the Bloom filter (or what- 
ever structure suppresses duplicates). In that case it will take 
at most | log. | partition steps for Carousel to get the right 
number of subsets. Each such step required time for a sin- 


gle logging phase, Tynase = M/b. The logger then reaches 
the right subset size, so that k& is the smallest value such that 
N/2* < M. The collector then goes through 2" phases to 
collect all NV sources. Note that 2* < 2N/M, or else k 
would not be the smallest value with N/2" < M. Hence, 
after the initial phases to find the right value of k, the ad- 
ditional collection time required is just 2/V/b, or a factor of 
two more than optimal. The total time is thus at most 


Mfloga(N/M)| , 2N 
a a es ae + =a 
b b 
and the generally the second term will dominate the first. 
Asymptotically, when N >> M, we are roughly within a 
factor of 2 of the optimal collection time. 

Note that the factor of 2 in the 2V/b term could in fact be 
replaced in theory by any constant a > 1, by increasing the 
number of sets in the partition by a factor of a rather than 
2 at each partition step. This would increase the number 
of partition steps to |log, ~ |. In practice we would not 
want to choose a value of a too close to 1, because keys will 
not be partitioned equally into sets, as we describe in the 
next subsection. Also, as we have described a factor of 2 is 
convenient in terms of partitioning via the low order bits of 
a hash. In what follows we continue to use the factor 2 in 
describing our algorithm, although it should be understood 
smaller constants (with other tradeoffs) are possible. 

In some ways our analysis is actually pessimistic. Early 
phases that fail can still log some items, and we have as- 
sumed that we could partition to require 2N/M phases, 
when generally the number of phases required will be 
smaller. However, we have also made some optimistic as- 
sumptions that we now revisit more carefully. 

4.2.1 Unequal Partitioning: Maximum Subset Analy- 
SIS 

If the logger uses k bits to partition keys, then there are 
Kk = 2* subsets. While the expected number of sources in 
a subset is * , even assuming a perfectly random hash func- 
tion, there may be deviations in the set sizes. Our algorithm 
will actually choose the value of & such that the biggest par- 
tition is fit in our memory budget /V/, not the average parti- 
tion, and we need to take this into account. That is, we need 
to analyze the maximum number of keys being assigned to a 
subset at each phase interval. 

In general, this can be handled using standard Chernoff 
bound analysis [8]. In this specific case, for example, [10] 
proves that with very high probability, the maximum num- 


2N ink 
K e 


Therefore we can assume that the smallest integer / satisfy- 
ing 


ber of sources in any subset is less than x + 


ee 2N Ink 
ik ik 


where K = 2", is greater than or equal to the k eventually 


<M, (1) 
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found by the algorithm. 
Note that the difference between our optimistic analysis, 
where we required the smallest k such that N/K < M, and 


wo 2NInK ; 
this analysis is generally very small, as ,/+;>— is gener- 


ally much less than V/K. That is, suppose that V/K < M, 


but = + a > M, so that at some point we might 
increase the value k to more than the smallest value such 
that V/k < M, because we unluckily have a subset in our 
partition that is bigger than the memory size. The key here 
is that in this case N/K = M, or more specifically 


N 2Nink 
A = kK | 


so that our collection time is now 


2k M Z oye /2N link 
b b b kK 


That is, the collection time is still, at most, very close to 
2N/b, with the addition of a smaller order term that con- 
tributes negligibly compared to 2V/b for large N. Hence, 
asymptotically, we are still with a factor of c of the optimal 
collection time, for any c > 2. 

4.2.2 Effects of False Positives 

So far, our analysis has not taken into account our method 
of suppressing duplicates. One natural approach is to use 
a Bloom filter, in which case false positives can lead to 
a source not being logged in a particular phase. This ex- 
plains our definition of an (€, c)-scalable logger. We have 
already seen that c can be upper bounded by any number 
larger than 2 asymptotically. Here € can be bounded by 
the false positive rate of the corresponding Bloom filter. As 
long as the number of elements per phase is no more than 
M = - 4 ,/2N tals 
the number of bits used for our Bloom filter, we can bound 
the false positive rate. For example, using 10M’ bits in the 
Bloom filter, the false positive rate is less than 1%, so our 
logger asymptotically converges to a (0.01, 2)-scalable log- 
ger. 

We make note of some additions one can make to improve 
the analysis. First, this analysis assumes only a single ma- 
jor cycle that logs each subset in the partition once. If one 
rerandomized the chosen hash functions each major cycle, 
then the probability a persistent source is missed each major 
cycle is independently at most € each time. Hence, after two 
such cycles, the probability of a source being missed is at 
most €2, and so on. 

Second, this analysis is pessimistic, in that in this setting, 
items are gradually added to an empty Bloom filter each 
phase; the Bloom filter is not in its full state at all times, 
so the false positive probability bound for the full filter is a 





with high probability, then given 
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large overestimate. For completeness we offer the follow- 
ing more refined analysis (which is standard) to obtain the 
expected false positive rate. (As usual, the actual rate is con- 
centrated around its expectation with high probability. ) 

Assume the Bloom filter has ™m bits and uses h hash func- 
tions. Consider whether the (i + 1)st item added to the fil- 
ter causes a false positive. First consider a particular bit 
in the Bloom filter. The probability that it is not set to 
1 by one of the hi hash functions thus far is (1 — =)". 
Therefore the probability of a false positive at this stage is 
(L—(1— 3)M)* e-em) 

Suppose M’ items are added into the Bloom filter within 


a phase interval. The expected fraction of false positives is 
then (approximately) ~ (l—e7 i ie 


(1 — e~“m_) given by the standard analysis for the false 
positive rate after /’ elements have been added. As an ex- 
ample, with M’ = 312, h = 5, and m = 5000, the standard 
analysis gives a false positive rate of 1.4 - 10~°, while our 
improved analysis gives a false positive rate of 2.5 - 107+. 


, compared to the 


Third, if collecting all or nearly all sources is truly 
paramount, instead of using a Bloom filter, one can use 
hash-based fingerprints of the sources instead. This requires 
more space than a Bloom filter (O(log M") bits per source 
if there are M’ per phase) but can reduce the probability of 
a false positive to inverse polynomial in MM’; that is, with 
high probability, all sources can be collected. We omit the 
standard analysis. 

4.2.3 Carousel and Dynamic Adaptation 

Under our persistent source assumption, any distinct key 
keeps arriving at the logger. In fact, for our algorithm as 
described, we need an even stronger assumption: each key 
must appear during the phase in which it is recorded, which 
means each key should arrive every N/b steps. Keys that 
do not appear this frequently may miss their phase and not 
be recorded. In most settings, we do not expect this to be 
a problem; any key that does not persist and appear this 
frequently does not likely represent a problematic source 
in terms of, for example, virus outbreaks. Our algorithm 
could be modified for this situation in various ways, which 
we leave as future work. One approach, for example, would 
be to sample keys in order to estimate the 95% percentile 
for average interarrival times between keys, and set the time 
interval for the phase time to gather a subset of keys accord- 
ingly. 

A more pressing issue is that the persistent source as- 
sumption may not hold because external actions may shut 
down infected sources, effectively changing the size of the 
set of keys to record dynamically. For example, during a 
worm outbreak, the number of infected sources rises rapidly 
at first but then they can go down due to external actions (for 
example, network congestion, users shutting down slow ma- 
chines due to infection, and firewalling traffic or blocking a 
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Figure 6: Flowchart of Carousel within Snort packet flow 


part of the network). In that case, the scalable logger may 
pick a large number of sampling bits & at first due to large 
outbreak traffic. However, the logger should correspond- 
ingly increase the value of k subsequently as the number of 
sources to record declines, to avoid inefficient logging based 
on too large a number of phases. 


5 Carousel Implementations 


We describe our Snort evaluation in Section 5.1 and a sketch 
of a hardware implementation in Section 5.2. 


5.1 Snort Implementation 

In this section, we describe our implementation of 
Carousel integrated into the Snort [13] IDS. We need to 
first understand the packet processing flow within Snort to 
see where we can interpose the Carousel scalable logger 
scheme. As in Figure 6, incoming packets are captured by 
libpcap, queued in a kernel buffer, and then processed by 
the callback function Process Packet. 

ProcessPacket first passes the packet to preprocessors, 
which are components or plug-ins serving to filter out suspi- 
cious activity and prepare the packet to be further analyzed. 
The detection engine then matches the packet against the 
rules loaded during Snort initialization. Finally, the Snort 


output module performs appropriate actions such as logging 
to files or generating alerts. Note that Snort is designed to 
be strictly single-threaded for multiplatform portability. 
The logical choice is to place Carousel module between 
the detection engine and output module so that the traffic 
can either go directly to the output plugin or get diverted 
through the Carousel module. We cannot place the logger 
module before the detection engine because we need to log 
only after a rule (e.g., a detected worm) is matched. Sim- 
ilarly, we cannot place the logger after the output module 
because by then it is too late to affect which information is 
logged. Our implementation also allows a rule to bypass 
Carousel if needed and go directly to the output module. 
Figure 6 is a flowchart of Carousel module for Snort in- 
terposed between the detection engine and the output model. 
The module uses the variables Tpnase = M/b (time for each 
phase) and & (number of sampling bits) described in Sec- 
tion 4.1. M is the number of keys that can be logged in a 
partition and 6 is the logging rate; in our experiments we 
use M = 500. The module also uses a 32-bit integer V that 
represents the hash value corresponding to the current parti- 
tion. Initially, & = 0, V = 0, the Bloom filter is empty, and 
a timer 7’ is set to fire after Ti,nase. The Bloom filter uses 
5000 bits, or 10 bits per key that can fit in /, and employs 
5 hash functions (SDBM, DJP, DEK, JS, PJW) taken from 
[9]. 
The Carousel scalable logger first compares the low-order 
k; bits of the hash of the packet key (we use the IP source 
address in all our experiments) to the low order k bits of V. 
If they do not match, the packet is not in the current partition 
and is not passed to the output logging. If the value matches 
but the key yields a positive from the Bloom filter (so it is 
either already logged, or a false positive), again the packet 
is not passed to the output module. If the value matches and 
the key does not yield a positive from the Bloom filter, then 
the module adds the key to the Bloom filter. If the Bloom 
filter overflows (the number of insertions exceeds /), then 
k; is incremented by 1, to create smaller size partitions. 
When the timer 7’ expires, a phase ends. We first check 
for underflow by testing whether the number of insertions 1s 
less than M//x. We found empirically that a factor x = 2.3 
worked well without causing oscillations. (A value slightly 
larger than 2 is sensible, to prevent oscillating because of 
the variance in partition sizes.) If there is no underflow, then 
the sampling value V is increased by 1 mod 2" to move to 
the next partition. 


5.2 Hardware Implementation 

Figure 7 shows a schematic of the base logic that can be 
inserted between the detector and the memory buffer used 
to store log records in an IPS ASIC. Using 1 Mbit for the 
Bloom filter, we estimate that the logic takes less than 5% 
of a low-end 10mm by 10 mm networking ASIC. All re- 
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Figure 7: Schematic of the Carousel Logger logic as part of 
an IPS Chip. 


sults are reported for a standard 400 Mhz 65 nm process cur- 
rently being used by networking vendors. The logic is flow- 
through: in other words, it can inserted between the detector 
and logging logic without changing any other logic. This al- 
lows the hardware to be incrementally deployed within an 
IPS without changing existing chip sets. 

We assume the detector passes a key (e.g., a source IP ad- 
dress) and a detection record (e.g., signature that matched) 
to the first block. The hash blocks computes a 64-bit hash 
of the key. Our estimates use a Rabin hash whose loop is 
unrolled to run at 40 Gbps using 20K gates. 

The hash output supplies a 64-bit number which is passed 
to the Compare block. This block masks out the low-order k 
bits of the hash (a simple XOR) and then compares it (com- 
parator) to a register value V that denotes the current hash 
value for this phase. If the comparison fails, the log attempt 
is dropped. If it succeeds, the key and record are passed to 
the Bloom filter logic. This is the most expensive part of the 
logic. Using 1 Mbit of SRAM to store the Bloom filter and 3 
parallel hash functions (these can be found by taking bits 1- 
20, 21-40, 41-60 etc of the first 64-bit hash computed with- 
out any further hash computations), the Bloom filter logic 
takes less than a few percent of a standard ASIC. 

As in the Snort implementation, a periodic timer module 
fires every T;,nase = M/b time and causes the value V to be 
incremented. Thus the remaining logic other than the Bloom 
filter (and to a smaller extent the hash computation) is very 
small. We use two copies of the Bloom filter and clear one 
copy while the other copy is used in a phase. The Bloom 
filter should be able to store a number of keys equal to the 
number of keys that can be stored in the memory buffer. 
Assuming 10 bits per entry, a 1 Mbit Bloom filter allows ap- 
proximately 100,000 keys to be handled in each phase with 
the targeted false positive probability. Other details (under- 
flow, overflow etc.) are similar to the Snort implementation 
and are not described here. 


6 Simulation Evaluation 


To evaluate Carousel under more realistic settings in which 
the population grows, we simulate the logger behavior when 
faced with a typical worm outbreak as modeled by a logistic 





100k 
=| 10,000 Sources 
peamaas 20,000 sources 
—-—| 40,000 sources ra 
a at 80,000 sources | .#” 


80k 





& 
co 


40k J 


20k es 


40k aa 
0 


0 200 360 680 








Number of logged sources 



































_ 1000 1350 2000 
time (sec) 


Figure 8: Performance of Carousel with different logging 
populations 


equation. We used a discrete event simulation engine that is 
a stripped down (for efficiency) version of the engine found 
in ns-2. We implement the Carousel scalable logger as de- 
scribed in Section 4. The simulated logger maintains the 
sampling bit count k and only increases k when the Bloom 
filter overflows; k stabilizes when all sources sampled dur- 
ing Tpnase fit the into memory budget MV with logging speed 
b. Simulation allows us to investigate the effect of various 
input parameters such as varying worm speed and whether 
the worm uses a hit list. Again, in all the simulations below, 
the Bloom filter uses 5000 bits and 5 hash functions (SDBM, 
DJP, DEK, JS, PJW) taken from [9]. For each experiment, 
we plot the average of 50 runs of simulation. 

We start by confirming the theory with a baseline exper- 
iment in Section 6.1 when all sources are present at time 0. 
We examine the performance of our logger with the logis- 
tic model in Section 6.2. We evaluate the impact of non- 
uniform source arrivals in Section 6.3. In Section 6.4, we 
examine a tradeoff between using a smaller number of bits 
per Bloom filter element and taking more more major cycles 
to collect all sources. Finally, in Section 6.5, we demon- 
strate the benefit of reducing & in the presence of worm re- 
mediation. 


6.1 Baseline Experiment 

In Figure 8, we verify the underlying theory of Carousel 
in Section 4 assuming all sources are present at time 0. We 
consider various starting populations NV = 10000 to 80000 
sources, a memory budget of MM = 500 items, and a logging 
speed b = 100 items per second. 

Figure 8 shows that the Carousel scalable logger collects 
almost all (at least 99.9%) items by t = 189,354,679 and 
1324 seconds for N = 10000, 20000, 40000 and 80000 re- 
spectively. This is no more than as. in all cases, matching 
the predictions of our optimistic analysis in Section 4. 

With these settings, the 10,000 sources will be parti- 


tioned into 32 subsets, each of size approximately 312 (in 
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expectation). In fact, our experiment trace shows that the 
number of sources per phase is in the range of 280 to 
340. Since the Bloom filter uses 5000 bits, essentially 
we have more than 10 bits per item once the right num- 
ber of partitions is found. As we calculated previously (in 
Section 4.2.2), the accumulated false positive rate of 312 
sources in a 5000-bit Bloom filter with 5 hash functions 
is 2.5-10~*. We also verified that most phases have no 
false positives. However, the Carousel algorithm may need 
additional major cycles to collect these remaining sources. 
Since a major cycle is 2” iterations, the theory predicts that 
Carousel requires more time to collect missed false posi- 
tives for larger & and hence for larger NV. We observe that 
the length of horizontal segment of each curve in Figure 8, 
which represents the collection time of all sources missed in 
the first major cycle, is longer for larger populations NV. 


6.2 Logger Performance with Logistic Model 

In the logistic model, a worm is characterized by H, the 
size of the initial hit list, the scanning rate, and a probability 
p of a scan infecting a vulnerable node. In our simulations 
below, we use a population of NV = 10,000, a memory size 
M = 500 with Bloom filter and J = 550 without Bloom 
filter, and logging speed 6 = 100 packets/sec; the best pos- 
sible logging time to collect all sources is N/b = 100 sec- 
onds. 

For our first 3 experiments, shown in Figures 9, 10 and 11, 
we use an initial hit list of H = 10,000. Since the hit list 
is the entire population, as in the baseline, all sources are 
infected at time t = 0. We use these simulations to see 
the effect of increasing the scan rate and monitoring abil- 
ity assuming all sources are infected. Our subsequent ex- 
periments will assume a much smaller hit list, more closely 
aligned with a real worm outbreak. 

For the first experiment, shown in Figure 9 we use 6 scans 
per second (to model a worm outbreak that matches the 
Code Red scan rate [17]) and p = 0.01. Figure 9 shows 
that Carousel needs 200 seconds to collect the V = 10, 000 
sources whereas the naive logger takes 4, 000 seconds. Fur- 
ther, the difference between Carousel and the naive logger 
increases with the fraction of sources logged. For example, 
Carousel is 6 times faster at logging 90% level of all sources 
but 20 times faster to log 100% of all sources. This is con- 
sistent with the analysis in Section 3.1. 

In Figure 10 we keep all the same parameters but increase 
the scan rate ten times to 60 scans/sec. The higher scan rate 
allows naive logging a greater chance to randomly sample 
packets and so the difference between scalable and naive 
logging is less pronounced. Figure 11 uses the same param- 
eters as Figure 9 but assumes that only 50% of the scanning 
packets are seen by the IPS. This models the fact that a given 
IPS may not see all worm traffic. Notice again that the dif- 
ference between naive and Carousel logging decreases when 


the amount of traffic seen by the IPS decreases. 

The remaining simulations assume a logistic model of 
worm growth starting with a hit list of H = 10 infected 
sources when the logging process starts. The innermost 
curve illustrates the infected population versus time, which 
obeys the well-known logistic curve. Even under this prop- 
agation model, Carousel still outperforms naive logging by 
a factor of almost 5. Carousel takes around 400 seconds to 
collect all sources while naive logger takes 2000 seconds. 

Figure 13 shows a slower worm. A slower worm can be 
modeled in many ways, such using a lower initial hit list, 
a lower scan rate, or a lower victim hitting probability. In 
Figure 13, we used a smaller hitting probability of 0.001. 
Intuitively, the faster the propagation dynamics, the better 
the performance of the Carousel scalable logger when com- 
pared to the naive logger. Thus the difference is less pro- 
nounced. 

Figure 14 demonstrates the scalability of Carousel, as we 
scale up NV from 10, 000 to 100, 000 with all other parame- 
ters staying the same (1.e., 6 scans per second and p = 0.01). 
Carousel takes around 9,000 seconds to collect all sources, 
while the naive logger takes 40,000 seconds. Note also that 
in all simulations with the logistic model (and indeed in all 
our experiments) the performance of the naive logger with a 
Bloom filter is indistinguishable from that of the naive log- 
ger by itself — as the theory predicts. 


6.3 Non-uniform source arrivals 

In this section, we study logging performance when the 
sources arrive at different rates as described in Section 3.1. 
In particular, we experiment with two equal sets of sources 
in which one set sends at ten times as fast as the other set. 
Figure 15b shows the result for the naive logger. We observe 
that the naive logger has a significant problem in logging the 
slow sources, which are responsible for dragging down the 
overall performance. As predicted by our model, the times 
taken to log all slow sources is ten times slower than the time 
taken to log all fast sources. The times to log all and almost 
all sources are 8, 000 and 4, 000 seconds respectively. 

Simply adding a Bloom filter only slightly increases the 
performance of the naive logger as predicted by the theory 
. On the other hand, Carousel is able to consistently log all 
sources as shown in Figure 15a. Carousel is not suscepti- 
ble to source arrival rates: sources from both the fast and 
slow sets are logged equally in each minor cycle once the 
appropriate number of sampling bits has been determined. 


6.4 Effect of Changing Hash Functions 


In this section, we study the effect of randomly changing 
the hash functions for the Bloom filter on each major cycle 
(that is, each pass through all of the sets of the partition). Re- 
call that this prevents similar arrival patterns between major 
cycles from causing the same source to be missed repeat- 
edly. 
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Figure 9: Performance of the Carousel Figure 10: High scan rate (60 scans/s) Figure 11: Reduced monitoring space 
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Figure 17abc compares the performance in Carousel of 
using fixed hash functions throughout and changing the 
hash functions each major cycle with 1-bit, 5-bit and 10- 
bit Bloom filters respectively. We changed the hash func- 
tions randomly by simply XORing each hash value with a 
new random number after each major cycle. In these ex- 
periments, a major cycle is approximately 160 seconds. For 
the 1-bit results, one can clearly see knees in the curves at 
t = 160, 320, and 480 corresponding to each major cycle in 
which the logger collects sources missed in previous cycles. 

Carousel instrumented with changing hash functions is 
much faster in collecting all sources across several major 
cycles. For example, for the 1-bit case, with changing hash 
functions each major cycle, it takes 1500 seconds to log all 
sources while using fixed hash functions takes 2500 seconds 
to log all sources. 

Should one prefer using a smaller number of bits per 
Bloom filter element and a greater number of major cycles 
or using a larger number of Bloom filter elements? This 
depends on the exact goals; for a fixed amount of memory, 
using a smaller number of Bloom filter bits per element al- 
lows the logger to log slightly more keys in every phase at 
the cost of a somewhat increased false positive probability. 
Based on our experiments, we believe using 5 bits per el- 
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ement provides excellent performance, although our Snort 
implementation (built before this experiment) currently uses 
10 bits per element. 

6.5 Adaptively Adjusting Sampling Bits 

As described in Section 4.2, an optimization for Carousel 
is to dynamically adapt the number of sampling bits k to 
match the currently active source population. In a worm 
outbreak, the value of / needs to be large as the when the 
population of infected sources is large, but it should be de- 
creased when the scope of the outbreak declines. 

To study this effect, we use the two-factor worm 
model [17] to model the dynamic process of worm propa- 
gation coexisting with worm remediation. The two-factor 
worm model augments the standard worm model with two 
realistic factors: dynamic countermeasures by network ad- 
ministrators/users (such as node immunization and traffic 
firewalls) and additional congestion due to worm traffic that 
makes scan rates reduce when the worm grows. The model 
was validated using measurements of actual Internet worms 
(see [17]). 

In Figure 16, we apply the two-factor worm model. The 
curve labeled “Source dynamics” records the number of in- 
fected sources as time progresses. Observe the exponential 
increase in the number of infected sources prior to ¢ = 100. 
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Figure 16: Dynamic source sam- 
pling in Carousel 
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Figure 17: Comparison of fixed vs. changing hash functions in Carousel 


However, the infected population then starts to decline. 

If we let the two-factor model run to completion, the num- 
ber of infected sources will eventually drop to zero, which 
makes logging sources less meaningful. In practice, how- 
ever, it is the logging that makes remediation possible. Thus 
to illustrate the efficacy of using fully adaptive sampling 
within the logger, we only apply the two-factor model until 
the infectious population drops to half of the initial vulnera- 
ble tally. We then look at the time to collect the final infected 
population. Note that a non-decreasing logger will choose a 
sampling factor based on the peak population and thus may 
take unnecessarily long to collect the final population of in- 
fected sources. 

Figure 16 shows that the fully adaptive scheme (incre- 
ment & on overflow, decrement on underflow) enhances per- 
formance in terms of logging time and also the capability to 
collect more sources before they are immunized. In partic- 
ular, the fully adaptive scheme collects almost all sources at 
220 seconds while the non-decreasing scheme (only incre- 
ments & on overflow, no decrements) takes more than 300 
seconds to collect all sources. Examining the simulation 
results more closely, we found the non-decreasing scheme 
adapted to k = 5 (32 partitions) and stayed there, while the 
fully adaptive scheme eventually reduced to k = 4 (16 par- 


titions) at time t = 130. 


7 Snort Evaluation 


We evaluate our implementation of Carousel in Snort using 
a testbed of two fast servers (Intel Xeon 2.8 GHz, 8 cores, 
8 GB RAM) connected by a 10 Gbps link. The first server 
sends simulated packets to be logged according to a spec- 
ified model while the second server runs Snort, with and 
without Carousel, to log packets. 

We set the timer period Tpnase = 5 seconds. The vul- 
nerable population is V = 10,000 sources and the memory 
buffer has 1/ = 500 entries. In the first experiment, the 
pattern of traffic arrival is random: each incoming packet 
is assigned a source that is uniformly and randomly picked 
from the population of N sources. 

Figure 18 shows the logging performance of Snort instru- 
mented with Carousel. Traffic arrives at the rate (B) of 100 
Mbps. All packets have a fixed size of 1000 bytes. The log- 
ging rate is b = 100 events per second, i.e., b + 1 Mbps 
and 2 = 100. Figure 18 shows the improvements in log- 
ging from our modifications. Specifically, our scalable im- 
plementation is able to log all sources within 300 seconds 
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Figure 18: Logging performance of Snort instrumented with 
Carousel under a random traffic pattern 
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Carousel under a periodic traffic pattern 


while standard Snort needs 1500 seconds. Also, adding a 
Bloom filter does not significantly improve the performance 
of Snort, matching our previous theory. 

Figure 19 shows the logging performance when the 
sources are perpetually dispatched in a periodic pattern 1, 
2, .., N, 1, 2..., N, ... Such highly regular traffic patterns 
are common in a number of practical scenarios, such as syn- 
chronized attacks or periodic broadcasts of messages in the 
communication fabric of large distributed systems. We ob- 
serve that the performance of standard Snort degrades by 
one order of magnitude as compared to the random pattern 
shown in Figure 18. Further examination shows that the 
naive logger keeps missing certain sources due to the regu- 
lar timing of the source arrivals. On the other hand, Carousel 
performance remains consistent in this setting. 

We also performed an experiment with two equally sized 
sets of sources arriving at different rates, with fast sources 
arriving at 1 Gbps and slow sources at 100 Mbps, as shown 
in Figure 20. Our observations are consistent with the sim- 
ulation results in Section 6.3. Note that in this setting stan- 
dard Snort takes about 20 times longer to collect all sources 
than Snort with Carousel (300 seconds versus 6000 sec- 
onds); in contrast, Snort took only about 5 times longer in 
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our experiment with random arrivals. 


$8 Related Work 


A number of recent papers have focused on high speed im- 
plementations of IPS devices. These include papers on fast 
reassembly [4], fast normalization [15, 16], and fast regular 
expression matching (e.g., [12]). To the best of our knowl- 
edge, we have not seen prior work in network security that 
focuses on the problem of scalable logging. However, net- 
work managers are not just interested in detecting whether 
an attack has occurred but also in determining which of their 
computers is already infected for the purposes of remedia- 
tion and forensics. 

The use of random partitions, where the size is adjusted 
dynamically, is probably used in other contexts. We have 
found a reference to the Alto file system [6], where if the file 
system is too large to fit into memory (but is on disk), then 
the system resorts to a random partition strategy to rebuild 
the file index after a crash. Files are partitioned randomly 
into subsets until the subsets are small enough to fit in main 
memory. While the basic algorithm is similar, there are dif- 
ferences: we have two scarce resources (logging speed and 
memory) while the Alto algorithm only has one (memory). 
We have duplicates while the Alto algorithm has no dupli- 
cate files; we have an analysis, the Alto algorithm has none. 


9 Conclusions 


In the face of internal attacks and the need to isolate parts 
of an organization, IPS devices must be implementable 
cheaply in high speed hardware. IPS devices have success- 
fully tackled hardware reassembly, normalization, and even 
Reg-Ex and behavior matching. However, when an attack 
is detected it is also crucial to also detect who the attacker 
was for potential remediation. While standard IPS devices 
can log source information, the slow speed of logging can 
result in lost information. We showed a naive logger can 
take a multiplicative factor of In N more time than needed, 
where JV is the infected population size, for small values of 
memory JV required for affordable hardware. 

We then described the Carousel scalable logger that is 
easy to implement in software or hardware. Carousel col- 
lects nearly all sources, assuming they send persistently, in 
nearly optimal time. While large attacks such as worms and 
DoS attacks may be infrequent, the ability to collect a list of 
infected sources and bots without duplicates and loss seems 
like a useful addition to the repertoire of functions available 
to security managers. 

While we have described Carousel in a security setting, 
the ideas applies to other monitoring tasks where the sources 
of all packets that match a predicate must be logged in the 
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Figure 20: Snort under non-uniform source arrivals 


face of high incoming speeds, low memory, and small log- 
ging speeds. The situation is akin to congestion control in 
networks; the classical solution, as found in say TCP or 
Ethernet, is for sources to reduce their rate. However, a 
passive logger cannot expect the sources to cooperate, es- 
pecially when the sources are attackers. Thus, the Carousel 
scalable logger can be viewed as a form of randomized ad- 
mission control where a random group of sources is admit- 
ted and logged in each phase. Another useful interpretation 
of our work is that while a Bloom filter of size M cannot 
usefully remove duplicates in a population of N >> M, 
the Carousel algorithm provides a way of recycling a small 
Bloom filter in a principled fashion to weed out duplicates 
in a very large population. 
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Abstract 


We present the design and implementation of a novel 
anti-malware system called SplitScreen. SplitScreen per- 
forms an additional screening step prior to the signa- 
ture matching phase found in existing approaches. The 
screening step filters out most non-infected files (90%) 
and also identifies malware signatures that are not of 1n- 
terest (99%). The screening step significantly improves 
end-to-end performance because safe files are quickly 
identified and are not processed further, and malware 
files can subsequently be scanned using only the signa- 
tures that are necessary. Our approach naturally leads to 
a network-based anti-malware solution in which clients 
only receive signatures they needed, not every malware 
signature ever created as with current approaches. We 
have implemented SplitScreen as an extension to Cla- 
mAV [13], the most popular open source anti-malware 
software. For the current number of signatures, our im- 
plementation is 2 faster and requires 2 less memory 
than the original ClamAV. These gaps widen as the num- 
ber of signatures grows. 


1 Introduction 


The amount of malicious software (malware)—viruses, 
worms, Trojan horses, and the like—is exploding. As 
the amount of malware grows, so does the number of 
signatures used by anti-malware products (also called 
anti-viruses) to detect known malware. In 2008, Syman- 
tec created over 1.6 million new signatures, versus a 
still-boggling six hundred thousand new signatures in 
2007 [2]. The ClamAV open-source anti-malware sys- 
tem similarly shows exponential growth in signatures, as 
shown in Figure 1. Unfortunately, this growth, fueled 
by easy-to-use malware toolkits that automatically cre- 
ate hundreds of unique variants [1, 20], is creating dif- 
ficult system and network scaling problems for current 
signature-based malware defenses. 

There are three scaling challenges. First, the sheer 
number of malware signatures that must be distributed 
to end-hosts is huge. For example, the ClamAV open- 
source product currently serves more than 120 TB of 
signatures per day [14]. Second, current anti-malware 
systems keep all signatures pinned in main memory. Re- 


ducing the size of the pinned-in-memory component is 
important to ensure operation on older systems and re- 
source constrained devices such as netbooks, PDAs or 
smartphones, and also to reduce the impact that malware 
scanning has on other applications running concurrently 
on the same system. Third, the matching algorithms typ- 
ically employed have poor cache utilization, resulting in 
a substantial slowdown when the signature database out- 
grows the L2 and L3 caches. 

We propose SplitScreen, an anti-malware architecture 
designed to address the above challenges. Our design is 
inspired by two studies we performed. First, we found 
that the distribution of malware in the wild is extremely 
biased. For example, only 0.34% of all signatures in 
ClamAV were needed to detect all malware that passed 
through our University’s e-mail gateways over a 4 month 
period (85.2). Of course, for safety, we cannot simply 
remove the unmatched signatures since a client must be 
able to match anything in the signature database. Second, 
the performance of current approaches is bottlenecked by 
matching regular expression signatures in general, and 
by cache-misses due to that scanning in particular. Since, 
in existing schemes, the number of cache-misses grows 
rapidly with the total number of signatures, the efficiency 
of existing approaches will significantly degrade as the 
number of signatures continues to grow. Others have 
made similar observations [10]. 

Ata high level, SplitScreen divides scanning into two 
steps. First, all files are scanned using a small, cache- 
optimized data structure we call a feed-forward Bloom 
filter (FFBF) [18]. The FFBF implements an approxi- 
mate pattern-matching algorithm that has one-sided er- 
ror: it will properly identify all malicious files, but may 
also identify some safe files as malicious. The FFBF out- 
puts: (1) a set of suspect matched files, and (2) a subset 
of signatures from the signature database needed to con- 
firm that suspect files are indeed malicious. SplitScreen 
then rescans the suspect matched files using the subset of 
signatures using an exact pattern matching algorithm. 

The SplitScreen architecture naturally leads to 
a demand-driven, network-based architecture where 
clients download the larger exact signatures only when 
needed in step 2 (SplitScreen still accelerates traditional 
single-host scanning when running the client and the 
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server on the same host). For example, SplitScreen re- 
quires 55.4 MB of memory to hold the current ~ 533,000 
ClamAV signatures. ClamAV, for the same signatures, 
requires 116 MB of main memory. At 3 million sig- 
natures, SplitScreen can use the same amount of mem- 
ory (55.4 MB), but ClamAV requires 534 MB. Given the 
0.34% hit rate in our study, SplitScreen would down- 
load only 10,200 signatures for step 2 (vs. 3 million). 
Our end-to-end analysis shows that, overall, SplitScreen 
requires less than 10% of the storage space of existing 
schemes, with only 10% of the network volume (85). We 
believe these improvements to be important for two rea- 
sons: (1) SplitScreen can be used to implement malware 
detection on devices with limited storage (e.g., residen- 
tial gateways, mobile and embedded devices), and (2) 
it allows for fast signature updates, which is important 
when counteracting new, fast spreading malware. In ad- 
dition, our architecture preserves clients’ privacy better 
than prior network-based approaches [19]. 

SplitScreen addresses the memory scaling challenge 
because its data structures grow much more slowly than 
in existing approaches (with approximately 11 bytes 
per signature for SplitScreen compared to more than 
170 bytes per signature for ClamAV). Combined with 
a cache-efficient algorithm, this leads to better through- 
put as the number of signatures grows, and represents 
the major advantage of our approach when compared 
to previous work that employed simple Bloom filters 
to speed-up malware detection (85.9 presents a detailed 
comparison with HashAV [10]). SplitScreen addresses 
the signature distribution challenges because users only 
download the (small) subset of signatures needed for step 
2. SplitScreen addresses constrained computational de- 
vices because the entire signature database need not fit 
in memory as with existing approaches, as well as hav- 
ing better throughput on lower-end processors. 

Our evaluation shows that SplitScreen is an effective 
anti-malware architecture. In particular, we show: 


e Malware scanning at twice the speed with half 
the memory: By adding a cache-efficient pre- 
screening phase, SplitScreen improves throughput 
by more than 2x while simultaneously requiring 
less than half the total memory. These numbers will 
improve as the number of signatures increases. 

e Scalability: SplitScreen can handle a very large in- 
crease in the number of malware signatures with 
only small decreases in performance (35% decrease 
in speed for 6 more signatures 85.4). 

e Distributed anti-malware: We developed a novel 
distributed anti-malware system that allows clients 
to perform fast and memory-inexpensive scans, 
while keeping the network traffic very low during 
both normal operation and signature updates. Fur- 
thermore, clients maintain their privacy by sending 
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Figure 1: Number of signatures and cache misses in 
ClamAV from April 2005 to March 2009. 


only information about malware possibly present on 
their systems. 

e Resource-constrained devices: SplitScreen can 
be applied to mobile devices (e.g., smartphones’), 
older computers, netbooks, and similar devices. 
We evaluated SplitScreen on a low-power device 
similar to an iPhone 3GS. In our experiments, 
SplitScreen worked properly even with 3 million 
signatures, while ClamAV crashed due to lack of 
resources at 2 million signatures. 

e Real-World Implementation: We have imple- 
mented our approach in ClamAV, an open-source 
anti-malware defense system. Our implementation 
is available at http://security.ece.cmu. 
edu. We will make the malware data sets used 
in this paper available to other researchers upon re- 
quest. 


2 Background 


2.1 Signature-based Virus Scanning 


Signature-based anti-malware defenses are currently the 
most widely used solutions. While not the only approach 
(e.g., recent proposals for behavior-based detection such 
as [15]), there are two important reasons to continue 
improving signature-based methods. First, they remain 
technically viable today, and form the bedrock of the two 
billion dollar anti-malware industry. More fundamen- 
tally, signature-based techniques are likely to remain an 
important component of anti-malware defenses, even as 
those defenses incorporate additional mechanisms. 

In the remainder of this section we describe signature- 
based malware scanning, using ClamAV [13] as a spe- 


'Smartphones have many connectivity options, and are able to run 
an increasingly wide range of applications (sometimes on open plat- 
forms). We therefore expect that they will be subjected to the same 
threats as traditional computers, and they will require the same security 
mechanisms. 
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cific example. ClamAV is the most popular open-source 
anti-malware solution, and already incorporates signif- 
icant optimizations to speed up matching and decrease 
memory consumption. We believe ClamAV to be rep- 
resentative of current malware scanning algorithms, and 
use it as a baseline from which to measure improvements 
due to our techniques. 

During initialization, ClamAV reads a signed signa- 
ture database from disk. The database contains two types 
of signatures: whole file or segment MD5 signatures and 
byte-pattern signatures written in a custom language with 
regular expression-like syntax (although they need not 
have wildcards) which we refer to as regular expression 
signatures (regexs). Figure | shows the distribution of 
MD5 and regular expression signatures in ClamAV over 
time. Currently 84% of all signatures are MD5 signa- 
tures, and 16% are regular expressions. In our experi- 
ments, however, 95% of the total scanning time is spent 
matching the regex signatures. 

When scanning, ClamAV first performs several pre- 
processing steps (e.g., attempting to unpack and uncom- 
press files), and then checks each input file sequentially 
against the signature database. It compares the MD5 of 
the file with MDS5s in the signature database, and checks 
whether the file contents match any of the regular expres- 
sions in the signature database. If either check matches a 
known signature, the file is deemed to be malware. 

ClamAV’s regular expression matching engine has 
been significantly optimized over its lifetime. Cla- 
mAV now uses two matching algorithms [16]: Aho- 
Corasick [3] (AC) and Wu-Manber [23] (WM).? The 
slower AC is used for regular expression signatures that 
contain wildcard characters, while the faster WM han- 
dles fixed string signatures. 

The AC algorithm builds a trie-like structure from the 
set of regular expression signatures. Matching a file with 
the regular expression signatures corresponds to walking 
nodes in the trie, where transitions between nodes are 
determined by details of the AC algorithm not relevant 
here. Successfully walking the trie from the root to a leaf 
node corresponds to successfully matching a signature, 
while an unsuccessful walk corresponds to not matching 
any signature. A central problem is that a trie constructed 
from a large number of signatures (as in our problem set- 
ting) will not fit in cache. Walks of such tries will typi- 
cally visit nodes in a semi-random fashion, causing many 
cache misses. 

The Wu-Manber [23] algorithm for multiple fixed pat- 
terns is a generalization of the single-pattern Boyer- 
Moore [6] algorithm. Matching using Wu-Manber en- 
tails hash table lookups, where a failed lookup means 
the input does not match a signature. In our setting, Cla- 


*ClamAV developers refer to this algorithm as extended Boyer- 
Moore. 


mAV uses a sliding window over the input file, where the 
bytes in window are matched against signatures by using 
a hash table lookup. Again, if the hash table does not 
fit in cache, each lookup can cause a cache miss. Thus, 
there is a higher probability of cache misses as the size 
of the signature database grows. 


2.2 Bloom Filters 


The techniques we present in this paper make extensive 
use of Bloom filters [5]. Consider a set S$. A Bloom fil- 
ter is a data structure used to implement set membership 
tests of S quickly. Bloom filters membership tests may 
have one-sided errors. A false positive occurs when the 
outcome of the test is x € S when x is not really a mem- 
ber of S. Bloom filters will never incorrectly report x ¢ S 
when x really is in S. 


Initialization. Bloom filter initialization takes the set S 
as input. A Bloom filter uses a bit array with m bits, and k 
hash functions to be applied to the items in S. The hashes 
produce integers with values between | and m, that are 
used as indices in the bit array: the k hash functions are 
applied to each element in S, and the bits indexed by the 
resulting values are set to | (thus, for each element in S, 
there will be a maximum of k bits set in the bit array— 
fewer if there are collisions between the hashes). 


Membership test. When doing a set membership test, 
the tested element is hashed using the k functions. If the 
filter bits indexed by the resulting values are all set, the 
element is considered a member of the set. If at least one 
bit is O, the element is definitely not part of the set. 


Important parameters. The number of hash func- 
tions used and the size of the bit array determine the 
false positive rate of the Bloom filter. If S has |S] ele- 
ments, the asymptotic false positive probability of a test 
is (1 —e7#S\/m)" 17], Fora fixed m, k =In2 x |S|/m min- 
imizes this probability. In practice however, k is often 
chosen smaller than optimum for speed considerations: 
a smaller k means computing a smaller number of hash 
functions and doing fewer accesses to the bit array. In 
addition, the hashing functions used affect performance, 
and when non-uniform, can also increase the false posi- 
tive rate. 


Scanning text. Text can be efficiently scanned for mul- 
tiple patterns using Bloom filters in the Rabin-Karp [11] 
algorithm. The patterns, all of which must be of the same 
length w, represent the set used to initialize the Bloom 
filter. The text is scanned by sliding a window of length 
w and checking rolling hashes of its content, at every po- 
sition, against the Bloom filter. Exact matching requires 
every Bloom filter hit to be confirmed by running a verifi- 
cation step to weed out Bloom filter false positives (e.g., 
using a subsequent exact pattern matching algorithm). 
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3 Design 


SplitScreen is inspired by several observations. First, 
the number of malware programs is likely to continue 
to grow, and thus the scalability of an anti-malware sys- 
tem is a primary concern. Second, malware is not con- 
fined to high-end systems; we need solutions that protect 
slower systems such as smartphones, old computers, net- 
books, and similar systems. Third, signature-based ap- 
proaches are by far the most widely-used in practice, so 
improvements to signature-based algorithms are likely to 
be widely applicable. Finally, in current signature-based 
systems all users receive all signatures whether they (ul- 
timately) need them or not, which is inefficient. 


3.1 Design Overview 


At a high level, an anti-malware defense has a set of 
signatures & and a set of files F. For concreteness, in 
this section we focus on regular expression signatures 
commonly found in anti-malware systems—so we use 
» to denote a set of regular expressions. We extend 
our approach to MD5 signatures in §3.5.1. The goal of 
the system is to determine the (possibly empty) subset 
Finalware © F of files that match at least one signature 
O€2. 


SplitScreen is an anti-malware system, but its ap- 
proach differs from existing systems because it does not 
perform exact pattern matching on every file in F. In- 
stead SplitScreen employs a cache-efficient data struc- 
ture called a feed-forward Bloom filter (FFBF) [18] that 
we created for doing approximate pattern matching. We 
use it in conjunction with the Rabin-Karp text search al- 
gorithm (see 82.2). The crux of the system is that the 
cache-efficient first pass has extremely high throughput. 
The cache-efficient algorithm is approximate in the sense 
that the FFBF scan returns a set of suspect files Fyuspect 
that is a superset of malware identified by exact pattern 
matching, 1.€., Fimalware © Fsuspect C PF’. In the second step 
we perform exact pattern matching on Fyyspec¢ and return 
exactly the set Fingiware. Figure 2 illustrates this strategy. 
The files in Fyuspect \ Fmatware tepresent the false positives 
that we refer to in various sections of this paper, and they 
are caused by 1) Bloom filter false positives (recall that 
Bloom filters have one-sided error) and 2) the fact that 
we can only look for fixed-size fragments of signatures 
and not entire signatures in the first step (the FFBF scan), 
as a consequence of how Rabin-Karp operates. 


>To put things in perspective, suppose there is a new Windows virus, 
and that the 1 billion computers with Microsoft Windows [4] are all 
running anti-malware software. A typical signature is at least 16 bytes 
(e.g., the size of an MD5). If each computer receives a copy of the 
signature, then that one virus has cost 15,258 MB of disk space world- 
wide to store the signature. 
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Figure 2: The SplitScreen scanning architecture. 


3.2 High-Level Algorithm 


The scanning algorithm used by SplitScreen consists of 
four processing steps called FFBF-INIT, FFBF-SCAN, 
FFBF-HIT, and FFBF-VERIFY, which behave as fol- 
lows: 


FFBF-INIT(£) — @ _ takes as input the set of signa- 
tures & and outputs a bit-vector @ which we call the 
all-patterns bit vector. FFBF-SCAN will use this 
bit-vector to construct an FFBF to scan files. 

FFBF-SCAN(@,F) > (0, Fouspect ) constructs an 
FFBF from @ and then scans each file f € F 
using the FFBF. The algorithm outputs the tuple 
(0", Fsuspect) Where Fsuspect C F is the list of files 
that were matched by @, and @’ is a bit vector 
that identifies the signatures actually matched by 
Fsuspect- We call 6’ the matched-patterns bit vector. 

FFBF-HIT(@’,+) > >’ _ takes in the matched-patterns 
bit vector @’ and outputs the set of regexp signatures 
»’ C ¥ that were matched during FFBF-SCAN. 

FFBF-VERIFY(2’, Fouspect ) > Fimalware takes in a 
set of regular expression signatures %’, a set 
of files Fsuspect, and outputs the set of files 
Finalware © F: sus pect matching ae 


The crux of the SplitScreen algorithm can be ex- 
pressed as: 


SCAN(E FS) = 
let (’, Fsuspect) = FFBF-SCAN(FFBF-INIT(£), F) in 
FFBF-VERIFY(FFBF-HIT(@’,), Fouspect) 


Let R denote the existing regular expression pattern 
matching algorithm, e.g., R is ClamAV. SplitScreen 
achieves the following properties: 


Correctness. SCAN will return the same set of files as 
identified by R, i.e., SCAN(Z, F') = R(X, F). 

Higher Throughput. SCAN runs faster than R. In par- 
ticular, we want the time for FFBF-SCAN plus 
FFBF-VERIFY plus FFBF-HIT to be less than the 
time to execute R. (Since FFBF-INIT is an initial- 
ization step performed only once per set of signa- 
tures, we do not consider it for throughput. We sim- 
ilarly discount in R the time to initialize any data 
structures in its algorithm.) 
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Less Memory. The amount of memory needed by 
SCAN is less than R. In particular, we want 
max(|@| + |@"|,|2’]) < |Z] (the bit vectors are not 
required to be in memory during FFBF-VERIFY). 
We expect that the common case is that most sig- 
natures are never matched, e.g., the average user 
does not have hundreds of thousands or millions of 
unique malware programs on their computer. Thus 
|’| < |Z], so the total memory overhead will be 
significantly smaller. In the worst case, where ev- 
ery signature is matched, &’ = L and SplitScreen’s 
memory overhead is the same as existing systems’s. 

Scales to More Signatures. Since the all-patterns bit 
vector @ takes a fraction of the space needed by 
typical exact pattern matching data structures, the 
system scales to a larger number of signatures. 

Network-based System. Our approach naturally leads 
to a distributed implementation where we keep the 
full set of signatures © on a server, and distribute 
@ to clients. Clients use @ to construct an FFBF 
and scan their files locally. After FFBF-SCAN re- 
turns, the client sends @¢’ to a server to perform 
FFBF-HIT, gets back the set of signatures L’ ac- 
tually needed to confirm malware is present. The 
client runs FFBF- VERIFY locally. 

Privacy. In previous network-based approaches such as 
CloudAV [19], a client sends every file to a server 
(the cloud) for scanning. Thus, the server can see all 
of the client’s files. In our setting, the client never 
sends a file across the network. Instead, the client 
sends ¢’, which can be thought of as a list of pos- 
sible viruses on their system. We believe this is a 
better privacy tradeoff. Furthermore, clients can at- 
tain deniability as explained in 83.4. Note our ar- 
chitecture can be used to realize the existing anti- 
malware paradigm where the client simply asks for 
all signatures. Such a client would still retain the 
improved throughput during scanning by using our 
FFBF-based algorithms. 





3.3. Bloom-Based Building Blocks 


Bloom filters can have false positives, so a hit must be 
confirmed by an exact pattern matching algorithm (hence 
the need for FFBF-VERIFY). Our first Bloom filter en- 
hancement reduces the number of signatures needed for 
verification, while the second accelerates the Bloom fil- 
ter scan itself. 


3.3.1 Feed-Forward Bloom Filters 


An FFBF consists of two bit vectors. The all-patterns 
bit vector is a standard Bloom filter initialized as de- 
scribed in 83.5.1. In our setting, the set of items is L. 
The matched-patterns bit vector is initialized to 0. 

As with an ordinary Bloom filter, a candidate item is 


Target file {m} 
(Bloom filter hit) 


Target file {n} 
(Bloom filter miss) 


-- 
- 
- 
2 
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bit vector @ 





Matched-patterns 
bit vector @ 





Suspect signature {0} 
corresponding to file {m} 


Figure 3: Building the matched-patterns bit vector as 
part of the feed-forward Bloom filter algorithm. 


hashed and the corresponding bits are tested against the 
all-patterns bit vector. If all the hashed bits are set in 
the all-patterns bit vector, the item is output as a FFBF 
match. When a match occurs, the FFBF will additionally 
set each bit used to check the all-patterns bit vector to 1 in 
the matched-patterns bit vector. In essence, the matched- 
patterns bit vector records which entries were found in 
the Bloom filter. This process 1s shown in Figure 3. 

After all input items have been scanned through the 
FFBF, the matched-patterns bit vector is a Bloom fil- 
ter representing the patterns that were matched. The 
user of an FFBF can generate a list of potentially match- 
ing patterns by running the input pattern set against the 
matched-patterns Bloom filter to identify which items 
were actually tested. Like any other Bloom filter output, 
the output pattern subset may contain false positives. 

In SplitScreen, @ is the all-patterns bit vector, and @’ is 
the matched-patterns bit vector created by FFBF-SCAN. 
Thus, @’ identifies (a superset of) signatures that would 
have matched using exact pattern matching. FFBF-HIT 
uses @’ to determine the set of signatures needed for 
FFBF-VERIFY. 


3.3.2 Cache-Partitioned Bloom Filters 


While a Bloom filter alone is more compact than other 
data structures traditionally used in pattern matching al- 
gorithms like Aho-Corasick or Wu-Manber, it is not oth- 
erwise more cache-friendly: it performs random access 
within a large vector. If this vector does not fit entirely 
in cache, the accesses will cause cache misses which will 
degrade performance substantially. 

SplitScreen uses our cache-friendly partitioned bloom 
filter design [18], which splits the input bit vector into 
two parts. The first is sized to be entirely cache-resident, 
and the first s hash functions map only into this section of 
the vector. The second is created using virtual memory 
super-pages (when available) and is sized to be as large 
as possible without causing TLB misses. The FFBF pre- 
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vents cache pollution by using non-cached reads into the 
second bloom filter. The mechanisms for automatically 
determining the size of these partitions and the number of 
hash functions are described in our technical report [18]. 

The key to this design is that it is optimized for bloom- 
filter misses. Recall that a Bloom filter hit requires 
matching each hash function against a “1” in the bit 
vector. As a result, most misses will be detected after 
the first or second test, with an exponentially decreasing 
chance of requiring more and more tests. 

The combination of a bloom-filter representation and 
a cache-friendly implementation provide a substantial 
speedup on modern architectures, as we show in 85. 


3.4 SplitScreen Distributed Anti-Malware 


In the SplitScreen distributed model, the input files are 
located on the clients, while the signatures are located on 
a server. The system works as follows: 


1. The server generates the all-patterns bit vector for 
the most recent malware signatures and transmits it 
to the client. It will be periodically updated to con- 
tain the latest malware bit patterns, just as existing 
approaches must be updated. 

2. The client performs the pre-screening phase us- 
ing the feed-forward Bloom filter, generates the 
matched-patterns bit vector, compresses it and 
transmits it to the server. 

3. The server uses the matched-patterns bit vector to 
filter the signatures database and sends the full def- 
initions (1% of the signatures) to the client. 

4. The client performs exact matching with the suspect 
files from the first phase and the suspect signatures 
received from the server. 


In this system, SplitScreen clients maintain only the 
all-patterns bit vectors @ (there will be two bit vectors 
corresponding to two FFBFs, one for each type of signa- 
ture). Instead of replicating the large signature database 
at each host, the database is stored only at the server and 
clients only get the signatures they are likely to need. 
This makes updates inexpensive: the server updates its 
local signature database and then sends differential all- 
patterns bit vector updates‘ to the clients. 

Since the clients don’t have to use the entire set of sig- 
natures for scanning, they also need less in-core memory 
(important for multi-task systems), and have smaller load 
times. 

SplitScreen does not expose as much private data as 
earlier distributed anti-malware systems [19], because 
the contents of clients’ files are never sent over the net- 
work, instead clients only send compact representations 


+An all-patterns bit vector update is a sparse—so highly 
compressible—bit vector that is overlaid on top of the old bit vector. 
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Figure 4: Data flow for distributed SplitScreen. 


(bit vectors) of short hashes (under 32 bits) of small (usu- 
ally under 20 bytes long) parts of undisclosed files and 
hashes of MD5 signatures of files. Clients concerned 
about deniability could set additional (randomly chosen) 
bits in their matched-patterns bit vectors in exchange for 
increased network traffic. 


3.5 Design Details 


3.5.1 Files and Signatures Screening 


As explained in 82.1, ClamAV uses two types of signa- 
tures: regexp signatures and MD5 signatures. We handle 
each with its own FFBF. 


Pattern signatures. The SplitScreen server extracts a 
fragment of length w from every signature (the way w 
is chosen is discussed in 85.8, while handling signatures 
smaller than w bytes and signatures containing wildcards 
is presented in 83.5.3 and 83.5.2). These fragments will 
be hashed and inserted into the FFBF. When performing 
FFBF scanning, a window of the same size (w) is slid 
through the examined files, and its content at every po- 
sition is tested against the filter. The hash functions we 
use in our FFBF implementation are based on hashing 
by cyclic polynomials [8] which we found to be effective 
and relatively inexpensive. To reduce computation fur- 
ther, we use the idea of Kirsch and Mitzenmacher [12] 
and compute only two independent hash functions, de- 
riving all the others as linear combinations of the first 
two. 


MDS signatures. ClamAV computes the MDS hash of 
each scanned file (or its sections) and searches for it in a 
hash table of malware MD5 signatures. SplitScreen re- 
places the hash table with an FFBF to save memory. The 
elements inserted into the filter are the MD5 signatures 
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themselves, while the candidate elements tested against 
the filter are the MD5 hashes computed for the scanned 
files. Since the MD5 signatures are uniform hash val- 
ues, the hash functions used for the FFBF are straight- 
forward: given a 16-byte MD5 signature b,b2...b16, we 
compute the 4-byte hash values as linear combinations of 
hy = by...b4a @ bs...bg and hz = bo...b12 8 b43...b 16. 


3.5.2 Signatures with Wildcards 


A small fraction (1.5% in ClamAV) of regular expression 
signatures contain wildcards, but SplitScreen’s Rabin- 
Karp-based FFBF algorithm operates with fixed strings. 
Simply expanding the regular expressions does not work. 
For example, the expression 


30666 f726d3e4 1 — 200}3c696e707574 


(where “{ 1-200}” matches any sequence no longer than 
200 bytes) generates 256°” different byte sequences. It 
is impractical to put all of them into the Bloom filter. 

Instead, SplitScreen extracts the invariant fragments 
(fixed byte subsequences) of a wildcard-containing sig- 
nature and selects one of these fragments to put in the 
FFBF (see 83.5.4 for more details about fragment selec- 
tion). 


3.5.3. Short Signatures 


If a regular expression signature does not contain a fixed 
fragment at least as long as the window size, the signa- 
ture cannot be added to the feed-forward Bloom filter. 
Decreasing the window size to the length of the short- 
est signature in the database would raise the Bloom fil- 
ter scan false positive rate to an unacceptable level, be- 
cause the probability of a random sequence of bytes be- 
ing found in any given file increases exponentially as the 
sequence shortens. 

SplitScreen therefore performs a separate, exact pat- 
tern matching step for short signatures concurrently with 
the FFBF scanning. Short signatures are infrequent 
(they represent less than 0.4% of ClamAV’s signature set 
for our default choice for the window size—12 bytes), 
so this extra step does not significantly reduce perfor- 
mance. The SplitScreen server builds the short signa- 
ture set when constructing the Bloom filters. Whenever 
a SplitScreen client requires Bloom filter updates, the 
SplitScreen server sends it this short signature set too. 


3.5.4 Selecting Fragments using Document Fre- 
quency 


While malware signatures are highly specific, the fixed- 
length substrings that SplitScreen uses may not be. For 
example, suppose that the window size is 16 bytes. Al- 
most every binary file contains 16 consecutive “Ox00” 
bytes. Since we want to keep as few files as possible for 
the exact-matching phase, we should be careful not to 
include such a pattern into the Bloom filter. 


»; = set of signatures 

O = input signature (oO € &) 

w = fixed window size 

y = length w fixed byte sequence (w-gram) in Oo 
DF (vy) = the document frequency of w-gram y 
outputs 
@; = FFBF signatures 

y short = Set of short signatures 








for all 0 € Ynas5, put O into nas 
for all 0 in L fixed U Lwitd 


if |o| > w 
for all fixed byte w-grams yin 0 
if DF (vy) =0 


put 7 into @regexp; GOTO next o 
Yeither shorter than w or no zero DF 
put 0 into Ysport 


Figure 5: Final FF BF-INIT algorithm. 


We use the document frequency (DF) of signature 
fragments in clean binary files to determine if a signa- 
ture fragment is likely to match safe files. The DF of a 
signature fragment represents the number of documents 
containing this fragment. A high DF indicates that the 
corresponding signature fragment is common and may 
generate many false positives. 

We compute the DF value for each window-sized sig- 
nature fragment in clean binary samples. For each signa- 
ture, we insert into the filter the first fragment with a DF 
value of zero (i.e., the fragment did not occur in any of 
the clean binary files). The signatures that have no zero 
DF fragments are added to the short signature set. 

We summarize our signature processing algorithm in 
Figure 5. The SplitScreen server runs this algorithm 
for every signature, and creates two Bloom filters—one 
for MD5 signatures, and one for the regular expression 
signatures—as well as the set of short signatures. 


3.5.5 Important Parameters 


We summarize in this section the important parameters 
that affect the performance of our system, focusing on 
the tradeoffs involved in choosing those parameters. 

Bit vector size. The size of the bit vectors trades scan 
speed for memory use. Larger bit vectors (specifically, 
larger non-cache-resident parts) result in fewer Bloom 
filter false positives, improving performance up to the 
point where TLB misses become a problem (see 83.3.2). 

Sliding window size. The wider the sliding window 
used to scan files during FFBF-SCAN, the less chance 
there is of a false positive (see 85.8). This makes FFBF- 
VERIFY run faster (because there will be fewer files to 
check). However, the wider the sliding window, the more 
signatures that must be added to the short signature set. 
Since we look for short signatures in every input file, 
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a large number of short signatures will reduce perfor- 
mance. 


Number of Bloom filter hash functions. The number of 
hash functions used in the FFBF algorithm (the k param- 
eter in §2.2) is a parameter for which an optimum value 
can be computed when taking into account the character- 
istics of the targeted hardware (e.g. the size of the caches, 
the latencies in accessing different levels of the memory 
hierarchy) as described in [18]. Empirically, we found 
that two hash functions each for the cache-resident part 
and the non-cache-resident part of the FFBF works well 
for a wide range of hardware systems. 


4 Implementation 


We have implemented SplitScreen as an extension of 
the ClamAV open source anti-malware platform, version 
0.94.2. Our code is available at http://security. 
ece.cmu.edu. The changes comprised approximately 
8K lines of C code. The server application used in 
our distributed anti-malware system required 5K lines of 
code. SplitScreen servers and SplitScreen clients com- 
municate with each other via TCP network sockets. 


The SplitScreen client works like a typical anti- 
malware scanner; it takes in a set of files, a signature 
database (@ in SplitScreen), and outputs which files are 
malware along with any additional metadata such as the 
malware name. We modified the existing Libclamav 
library to have a two-phase scanning process using FF- 
BFs. 


The SplitScreen server generates @ from the default 
ClamAV signatures using the algorithm shown in Fig- 
ure 5. Note that SplitScreen can implement traditional 
single-host anti-malware by simply running the client 
and server on the same host. We use run-length encoding 
to compress the bit vectors and signatures sent between 
client and server. 


5 Evaluation 


In this section we first detail our experimental setup, 
and then briefly summarize the malware measurements 
that confirm our hypothesis that most of the volume 
of malware can be detected using a few signatures. 
We then present an overall performance comparison of 
SplitScreen and ClamAV, followed by detailed measure- 
ments to understand why SplitScreen performs well, how 
it scales with increasing numbers of regexp and MD5 
signatures, and how its memory use compares with Cla- 
mAV. We then evaluate SplitScreen’s performance on 
resource constrained devices and its performance in a 
network-based use model. 
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5.1 Evaluation Setup 


Unless otherwise specified, our experiments were con- 
ducted on an Intel 2.4 GHz Core 2 Quad with 4 GB of 
RAM and a8 MB split L2 cache using a 12-byte window 
size (see 83). When comparing SplitScreen against Cla- 
mAV, we exclude data structure initialization time in Cla- 
mAV, but count the time for FFBF_INIT in SplitScreen. 
Thus, our measurements are conservative because they 
reflect the best possible setting for ClamAV, and the 
worst possible setting for SplitScreen. Unless otherwise 
specified, we report the average over 10 runs. 

Scanned files. Unless otherwise specified, all mea- 
surements reflect scanning 344 MB of 100% clean files. 
We use clean files because they are the common case, 
and exercise most code branches. (85.7 shows perfor- 
mance for varying amounts of malware.) The clean files 
come from a fresh install of Microsoft Windows XP plus 
typical utilities such as MS Office 2007 and MS Visual 
Studio 2007. 

Signature sets. We use two sets of signatures for 
the evaluation. If unspecified, we focus on the current 
ClamAV signature set (main v.50 and daily v.9154 from 
March 2009), which contained 530K signatures. We use 
four additional historical snapshots from the ClamAV 
source code repository. To measure how SplitScreen 
will improve as the number of signatures continues to 
grow, we generated additional regex and MDS signatures 
(“projected” in our graphs) in the same relative propor- 
tion as the March signature set. The synthetic regexs 
were generated by randomly permuting fixed strings in 
the March snapshot, while the synthetic MDSs are ran- 
dom 16-byte strings. 


5.2 Malware Measurements 


Given a set of signatures &, we are interested in know- 
ing how many individual signatures L’ are matched in 
typical scenarios, i.e., |’| vs. |Z]. We hypothesized 
that most signatures are rarely matched (|L’| < |Z), e.g., 
most signatures correspond to malware variants that are 
never widely distribution. 

One typical use of anti-malware products is to filter 
out malware from email. We scanned Carnegie Mellon 
University’s email service from May Ist to August 29th 
of 2009 with ClamAV. 1,392,786 malware instances were 
detected out of 19,443,381 total emails, thus about 7% of 
all email contained malware by volume. The total num- 
ber of unique signatures matched was 1,825, which is 
about 0.34% of the total signatures—see figure 6. 

Another typical use of anti-malware products is to 
scan files on disk. We acquired 393 GB of malware from 
various sites, removed duplicate files based upon MDS, 
and removed files not recognized by ClamAV using the 
v.9661 daily and v.51 main signature database. The total 
number of signatures in ClamAV was 607,988, and the 
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Figure 6: The overall amount of malware detected 
(y axis) vs. the total number of malware signatures 
needed (x axis). For example, about 1000 signatures 
are needed to detect virtually all malware. 


total number of unique malware files was 960,766 (about 
221 GB). ClamAV reported out of the 960,766 unique 
files that there were 128,992 unique malware variants. 
Thus, about 21.2% of signatures were matched. 

We conclude that indeed most signatures correspond 
to rare malware, while only a few signatures are typi- 
cally needed to match malware found in day-to-day op- 
erations. 
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Figure 7: Performance of SplitScreen and ClamAV 
using historical and projected ClamAV signature sets. 


5.3. SplitScreen Throughput 


We ran SplitScreen using both historical and projected 
signature sets from ClamAV, and compared its perfor- 
mance to ClamAV on the same signature set. Figure 7 
shows our results. SplitScreen consistently improves 
throughput by at least 2 x on previous and existing signa- 
tures, and the throughput improvement factor increases 
with the number of signatures. 


Understanding throughput: Cache misses. We hy- 
pothesized that a primary bottleneck in ClamAV was 
L2 cache misses in regular expression matching. Fig- 
ure 8 shows ClamAV’s throughput and memory use as 
the number of regular expression signatures grows from 
zero to roughly 125,000, with no MDS signatures. In 
contrast, increasing the number of MDS signatures lin- 
early increases the total memory required by ClamAV, 
but has almost no effect on its throughput. With no reg- 
exp signatures, ClamAV scanned nearly 50 MB/sec, re- 
gardless of the number of MD5 signatures. 
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Figure 8: ClamAV scanning throughput and memory 
consumption as the number of regular expression sig- 
natures increases. 


Figure 9 compares the absolute number of L2 cache 
misses for ClamAV and SplitScreen as the (total) num- 
ber of signatures increases. The dramatic increase in L2 
cache misses for ClamAV suggest that this is, indeed, 
a major source of its performance degradation. In con- 
trast, the number of cache misses for SplitScreen is much 
lower, helping to explain its improved scanning perfor- 
mance. These results indicate that increasing the number 
of regex signatures increases the number of cache misses, 
decreases throughput, and thus is the primary throughput 
bottleneck in ClamAV. 


5.4 SplitScreen Scalability and Perfor- 
mance Breakdown 


How well does SplitScreen scale? We measured three 
scaling dimensions: 1) how throughput is affected as the 
number of regular expression signatures grows, 2) how 
FFBF size affects performance and memory use, and 3) 
where SplitScreen spends time as the number of signa- 
tures increases. 

Throughput. Figure 10 shows SplitScreen’s through- 
put as the number of signatures grows from 500K (ap- 
proximately what is in ClamAV now) to 3 million. 
At 500K signatures, SplitScreen performs about 2.25 
times better than ClamAV. At 3 million signatures, 
SplitScreen performs 4.5 times better. The 4.5x 
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throughput increase is given with a 32 MB FFBF. These 
measurements are all an average over 10 runs. The worst 
of these runs is the first when the file system cache is 
cold, when SplitScreen was only 3 x faster than ClamAV 
(graph omitted due to space). 

FFBF Size. We also experimented with smaller 
FFBPF’s of size 8, 12, 20, and 36 MB, as shown in Fig- 
ure 10. The larger the FFBF, the smaller the false positive 
ratio, thus the greater the performance. We saw no addi- 
tional performance gain by increasing the FFBF beyond 
36 MB. 


# FFBF-SCAN  FEFBF-HIT 

sigs +Short Sigs. + Traffic pee VERS 
SOOK 27.2 (94.7%) 0.7 (2.6%) 0.8 (2.7%) 

1M 27.4(92.4%) 0.9 (3.0%) 1.4 (4.6%) 

2M 26.5 (76.0%) 1.3 (3.7%) 7.1 (20.3%) 
3M 24.2 (58.3%) 1.7 (4.1%) 15.6 (37.6%) 


Table 1: Time spent per step by SplitScreen to scan 
1.55 GB of files (in seconds and by percentage). 


Per-Step Breakdown. Table | shows the breakdown of 
time spent per phase. We do not show FFBF-INIT which 
was always < 0.01% of total time. As noted earlier, we 
omit ClamAV initialization time in order to provide con- 
servative comparisons. 

We make draw several conclusions from our experi- 
ments. First, SplitScreen’s performance advantage con- 
tinues to grow as the number of regexp signatures in- 
creases. Second, the time required by the first phase 
of scanning in SplitScreen holds steady, but the exact 
matching phase begins to take more and more time. This 
occurs because we held the size of the FFBF constant. 
When we pack more signatures into the same size FFBF, 
the bit vector becomes more densely populated, thus in- 
creasing the probability of a false positive due to hash 
collisions. Such false positives result in more signatures 
to check during FFBF-VERIFY. Thus, while the overall 
scan time is relatively small, increasing the SplitScreen 
FFBF size will help in the future, 1.e., we can take ad- 
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forward Bloom filters, keeping the cache-resident 
portion constant. 
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vantage of the larger caches the future may bring. Note 
that the size increases to the FFBF need be nowhere 
near as large as with ClamAV, e.g., a few megabytes for 
SplitScreen vs. a few hundred megabytes for ClamAV. 


5.5 SplitScreen on Constrained Devices 


Figure 11 compares the memory required by SplitScreen 
and ClamAV for FFBF-SCAN. 533,183 signatures 
in ClamAV consumed about 116 MB of memory. 
SplitScreen requires only 55.4 MB, of which 40 MB are 
dedicated to FFBFs. Our FFBF was designed to min- 
imize false positives due to hash collisions but not ad- 
versely affect performance due to TLB misses (83.3.2). 
At3 million signatures, ClamAV consumed over 500 MB 
of memory, while SplitScreen still performed well with a 
40 MB FFBF. 

We then tested SplitScreen’s performance with four 
increasingly more limited systems. | We compare 
SplitScreen and ClamAV using the current signature set 
on: a 2009 desktop computer (Intel 2.4 GHz Core 2 
Quad, 4 GB RAM, 8 MB L2 cache); a 2008 Apple lap- 
top (Intel 2.4 GHz Core 2 Duo, 2 GB RAM, 3 MB L2 
cache); a 2005 desktop (Intel Pentium D 2.8 GHz, 4 GB 
RAM, 2 MB L2 Cache); and a Alix3c2 (AMD Geode 
500 Mhz, 256 MB RAM, 128 KB L2 Cache) that we use 
as a proxy for mobile/handheld devices.” 

Figure 12 shows these results. On the desktop sys- 
tems and laptop, SplitScreen performs roughly 2x better 
than ClamAV. On the embedded system, SplitScreen per- 
forms 30% better than the baseline ClamAV. The modest 
performance gain was a result of the very small L2 cache 
on the embedded system. 

However, our experiments indicate a more fundamen- 
tal limitation with ClamAV on the memory-constrained 
AMD Geode. When we ran using the 2 million signature 
dataset, ClamAV exhausted the available system memory 
and crashed. In contrast, SplitScreen successfully oper- 


>The AMD Geode has hardware capabilities similar to the iPhone 
3GS, which has a 600 MHz ARM processor with 128 MB of RAM. 
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Figure 11: Memory use of SplitScreen and ClamAV. 
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ated using even the 3 million signature dataset. These 
results suggest that SplitScreen is a more effective archi- 
tecture for memory-constrained devices. 


5.6 SplitScreen Network Performance 


In the network-based setting there are three data trans- 
fers between server and client: 1) the initial bit vector 
@ (the all-patterns bit vector) generated by FFBF-INIT 
sent from the server to the client; 2) the bit vector @’ (the 
matched-patterns bit vector) for signatures matched by 
FFBF-SCAN sent by the client to the server; and 3) the 
set of signatures L’ needed for FFBF-VERIFY sent by 
the server to the client. 

Recall that SplitScreen compresses the (likely-sparse) 
bit vectors before transmission. The compressed size of 
@' depends upon the signatures matched and the FFBF 
false positive rate. Table 2 shows the network traffic and 
false-positive rates in different cases. The size of both 
o’ and &’ remains small for these files, requiring signifi- 
cantly less network traffic than transferring the entire sig- 
nature set. 

Table 3 shows the size of the all-patterns bit vector @, 
which must be transmitted periodically to clients, for 1n- 
creasing (gzipped) ClamAV database sizes. SplitScreen 
requires about 10% the network bandwidth to distribute 
the initial signatures to clients. 

Overall, the volume of network traffic for SplitScreen 
(|| + |o’| + |Z’|) is between 10%-13% of that used 
by ClamAV on a fresh scan. On subsequent scans 
SplitScreen will go out and fetch new @’ and \’ if new 
signatures are matched (e.g., the ’ of a new scan has 
different bits set than previous scans). However, since 
\>’| < |Z], the total lifetime traffic is still expected to be 
very small. 








5.7 Malware Scanning 


How does the amount of malware affect scan through- 
put? We created a 100 MB corpus using different ratios 
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Figure 12: Performance for four different systems 
(differing CPU, cache, and memory size). 
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Figure 13: Throughput as % of malware increases 
(using total scan time including verification). 


of malware and clean PE files. Figure 13 shows that 
SplitScreen’s performance advantage slowly decreases 
as the percentage of malware increases, because it must 
re-scan a larger amount of the input files using the exact 
signatures. 


5.8 Additional SplitScreen Parameters 


In addition to the FFBF size (85.4), we measured the ef- 
fect of different hash window sizes and the effectiveness 
of using document frequency to select good tokens for 
regular expression signatures. 


Fixed string selection and document frequency. The 
better the fixed string selection, the lower the false posi- 
tive rate will be, and thus the better SplitScreen performs. 
We use the document frequency (DF) of known good 
programs to eliminate fixed strings that would cause false 
positives. Our experiments were conducted using the 
known clean binaries as described in 85.1. We found the 
performance increase in Figure 13 was in part due to DF 
removing substrings that match clean files. We did a sub- 
sequent test with 344 MB of PE files from our data set. 
Without document frequency, we had a 22% false pos- 
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Size of Number of j j Total traffic False-positive 
aoe e nee target files _ target files eye eye) (Bytes) rate 
Randomly generated 200 MB 1,000 80 405 485 0.50% 
Randomly generated 2 GB 10,000 224 223 447 0.14% 

Clean PE files 340 MB 1,957 1,829 15,082 16,911 4.19% 
Clean ELF files 157 MB 1,319 180 11,766 13,338 9.26% 
100% Malware 170 MB 534 17,100 160,828 177,928 N/A 
100% Malware 1.1 GB D277 61,748 648,962 710,710 N/A 
Table 2: Network traffic for SplitScreen using 530K signatures. 
Window Size Avg. F-P Max. F-P # Short Sigs 
| ClamAV FFBEF + Short 8 bytes 17.3 18.9 1169 
# signatures 10 bytes 11.6 14.3 1350 
CVD (MB) Sigs (MB) 
12 bytes 8.56 9.36 1624 
130K 9.9 0.77 
14 bytes 6.70 ca 2004 
mets es 16 bytes 5.23 6.31 3203 
530K —«-. 20.8 2.0 : : 


Table 4: False positive rates for different window sizes. 
The average and maximum FP rates are from the 10- 
fold cross validation of DF on 1.55 GB of clean binaries. 


Table 3: Signature size initially sent to clients. 


itive rate and a throughput of 10 MB/s. With document 
frequency, we had a 0.9% false positive rate and 12 MB/s 
throughput. We also performed 10-fold cross validation 
to confirm that document frequency is beneficial, with 
the average and max false positive rate per window size 
shown in Table 4. 


Window size. A shorter hash window results in fewer 
short regexp signatures, but increases the false positive 
rate. The window represents the number of bytes from 
each signature used for FFBF scanning. For example, a 
window of 1 byte would mean a file would only have to 
match | byte of a signature during FFBF-SCAN. (The 
system ensures correctness via FFBF-VERIFY.) 

Using an eight-byte window, hash collisions caused a 
3.98% of files to be mis-identified as malware in FFBF- 
SCAN that later had to be weeded out during FFBF- 
VERIFY. With a sixteen-byte window, the false posi- 
tive rate was only 0.46%. The throughput for an 8 and 
16 byte window was 9.44 MB/s and 8.67 MB/s, respec- 
tively. Our results indicate a window size of 12 seems 
optimal as a balance between the short signature set size, 
the false positive rate, and the scan rate. 


5.9 Comparison with HashAV 


The work most closely related to ours is HashAV [10]. 
HashAV uses Bloom filters as a first pass to reduce the 
number of files scanned by the regular expression algo- 
rithms. Although there are many significant differences 
between SplitScreen and HashAV (see 87), HashAV 
serves as a good reference for the difference between a 
typical Bloom scan and our FFBF-based techniques. 

To enable a direct comparison, we made several mod- 
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ifications to each system. We modified SplitScreen to 
ignore file types and perform only the raw scanning sup- 
ported by HashAV. We disabled MD5 signature compu- 
tation and scanning in SplitScreen to match HashAV’s 
behavior. We updated HashAV to scan multiple files in- 
stead of only one. Finally, we changed the evaluation 
to include only the file types that HashAV supported. 
It is important to note that the numbers in this section 
are not directly comparable to those in previous sections. 
HashAV did not support the complex regexp patterns that 
most frequently show up in SplitScreen’s small signa- 
tures set, so the performance improvement of SplitScreen 
over ClamAV appears larger in this evaluation that it does 
in previous sections. 


Figure 14 shows that with l1O00OK signatures, 
SplitScreen performs about 9x better than HashAV, 
which in turn outperforms ClamAV by a factor of two. 
SplitScreen’s performance does not degrade with an 
increasing number of signatures, while HashAV’s per- 
formance does. One reason is SplitScreen is more cache 
friendly; with large signature sets HashAV’s default 
Bloom filter does not fit in cache, and the resulting cache 
misses significantly degrade performance. If HashAV 
decreased the size of their filter, then there would be 
many false positives due to hash collisions. Further, 
HashAV does not perform verification using the small 
signature set as done by SplitScreen. As a result, the 
data structure for exact pattern matching during HashAV 
verification will be much larger than during verification 
with SplitScreen. 
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Figure 14: HashAV and SplitScreen scan throughput. 


6 Discussion 


We see the SplitScreen distributed model providing ben- 
efits in several scenarios, beyond the basic speedup pro- 
vided by our approach. As shown in 85.6, a SplitScreen 
client requires 10x less data than a ClamAV client be- 
fore it can start detecting malware. Furthermore, sending 
a new signature takes 8 bytes for SplitScreen (remem- 
ber from 83.5.1 that all the FFBF bits corresponding to a 
signature are generated from just two independent 32-bit 
hashes) and 20 to 350 bytes on ClamAV. These factors 
make SplitScreen more effective in responding to new 
malware because there is less pressure on update servers, 
and clients get updates faster. The other advantage to 
dynamically downloading signatures is that SplitScreen 
can be installed on devices with limited storage space, 
like residential gateways or mobile devices. 

In the SplitScreen distributed anti-malware model, the 
server plays an active role in the scanning process: it ex- 
tracts relevant signatures from the signature database for 
every scan that generates suspect files on a client. Run- 
ning on an Intel 2.4 GHz Core 2 Quad machine, the un- 
optimized server can sustain up to 14 requests per second 
(note that every request corresponded to a scan of 1.5 GB 
of binary files, so the numbers of suspect files and signa- 
tures were relatively high). As such, a single server can 
handle the virus scanning load of a set of clients scan- 
ning 21 GB/sec of data. While this suffices for a proof- 
of-concept, we believe there is substantial room to opti- 
mize the server’s performance in future work: (1) Clients 
can cache signatures from the server by adding them to 
their short signatures set; (2) the server can use an in- 
dexing mechanism to more rapidly retrieve the neces- 
sary signatures based upon the bits set in the matched- 
patterns bit vector; (3) conventional or, perhaps, peer-to- 
peer replication techniques can be easily used to replicate 
the server, whose current implementation is CPU inten- 
sive but does not require particularly large amounts of 
disk or memory. These improvements are complemen- 


tary to our core problem of efficient malware scanning, 
and we leave them as future work. 


7 Related Work 


CloudAV [19] applies cloud computing to anti-virus 
scanning. It exploits ‘N-version protection’ to detect 
malware in the cloud network with higher accuracy. Its 
scope is limited, however, to controlled environments 
such as enterprises and schools to avoid dealing with 
privacy. Each client in CloudAV sends files to a cen- 
tral server for analysis, while in SplitScreen, clients send 
only their matched-patterns bit vector. 

Pattern matching, including using Bloom filters, has 
been extensively studied in and outside of the malware 
detection context. Several efforts have targeted net- 
work intrusion detection systems such as Snort, which 
must operate at extremely high speed, but that have a 
smaller and simpler signature set [21]. Bloom filters are 
a commonly-proposed technique for hardware acceler- 
ated deep packet inspection [9]. 

HashAV proposed using Bloom filters to speed up 
the Wu-Manber implementation used in ClamAV [10]. 
They show the importance of taking into account the 
CPU caches when designing exact pattern matching al- 
gorithms. However, their system does not address all as- 
pects of an anti-malware solution, including MDS sig- 
natures, signatures shorter than the window size, cache- 
friendly Bloom filters when the data size exceeds cache 
size, and reducing the number of signatures in the sub- 
sequent verification step. Furthermore, the SplitScreen 
FFBF-based approach scales much better for increases 
in the number of signatures. 

A solution for signature-based malware detection in 
resource constrained mobile devices had previously been 
presented in [22]. Similarly to SplitScreen, it used sig- 
nature fragment selection to accelerate the scanning, but 
could only handle fixed byte signatures, and was less 
memory efficient than SplitScreen. 

The “Oyster” ClamAV extensions [17] replaced Cla- 
mAV’s Aho-Corasick trie with a multi-level trie to 1m- 
prove its scalability, improving throughput, but did not 
change its fundamental cache performance or reduce the 
number of signatures that files must be scanned against. 


$8 Conclusion 


SplitScreen’s two-phase scanning enables fast and 
memory-efficient malware detection that can be decom- 
posed into a client/server process that reduces the amount 
of storage on, and communication to, clients by an or- 
der of magnitude. The key aspects that make this de- 
sign work are the observation that most malware signa- 
tures are never matched—but must still be detectable— 
combined with the feed-forward Bloom filter that re- 
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duces the problem of malware detection to scanning a 
much smaller set of files against a much smaller set of 
signatures. Our evaluation of SplitScreen, implemented 
as an extension of ClamAV, shows that it improves scan- 
ning throughput using today’s signature sets by over 2x, 
using half the memory. The speedup and memory sav- 
ings of SplitScreen improve further as the number of sig- 
natures increases. Finally, the efficient distributed execu- 
tion made possible using SplitScreen holds the potential 
to enable scalable malware detection on a wide range of 
low-end consumer and handheld devices. 
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Abstract 


We present a novel network-level behavioral malware 
clustering system. We focus on analyzing the structural 
similarities among malicious HTTP traffic traces gener- 
ated by executing HTTP-based malware. Our work is 
motivated by the need to provide quality input to algo- 
rithms that automatically generate network signatures. 
Accordingly, we define similarity metrics anong HTTP 
traces and develop our system so that the resulting clus- 
ters can yield high-quality malware signatures. 

We implemented a proof-of-concept version of our 
network-level malware clustering system and performed 
experiments with more than 25,000 distinct malware 
samples. Results from our evaluation, which includes 
real-world deployment, confirm the effectiveness of the 
proposed clustering system and show that our approach 
can aid the process of automatically extracting net- 
work signatures for detecting HTTP traffic generated by 
malware-compromised machines. 


1 Introduction 


The battle against malicious software (a.k.a. malware) 
is becoming more difficult. Today’s malware writers 
commonly use executable packing [16] and other code 
obfuscation techniques to generate a large number of 
polymorphic variants of the same malware. As a con- 
sequence, anti-viruses (AVs) have a hard time keeping 
their signature database up-to-date, and their AV scan- 
ners often have many false negatives [26]. 

Although it is easy to create many polymorphic vari- 
ants of a given malware sample, different variants of the 
same malware will exhibit similar malicious activities, 
when executed. Behavioral malware clustering groups 
malware variants according to similarities in their ma- 
licious behavior. This process is particularly useful be- 
cause once a number of different variants of the same 
malware have been identified and grouped together, it is 
easier to write a generic behavioral signature that can 


be used to detect future malware variants with low false 
positives and false negatives. 

Network-level signatures have some attractive proper- 
ties compared to system-level signatures. For example, 
enforcing system-level behavioral signatures often re- 
quires the use of virtualized environments and expensive 
dynamic analysis [21, 34]. On the other hand, network- 
level signatures are usually easier to deploy because we 
can take advantage of existing network monitoring 1n- 
frastructures (e.g., intrusion detection systems and alert 
monitoring tools), and monitor a large number of ma- 
chines without introducing overhead at the end hosts. 

The vast majority of malware needs a network con- 
nection in order to perpetrate their malicious activities 
(e.g., sending spam, exfiltrating private data, download- 
ing malware updates, etc.). In this paper, we focus on 
network-level behavioral clustering of HTTP-based mal- 
ware, namely, malware that uses the HTTP protocol as 
its main means of communicating with the attacker or 
perpetrating their malicious intents. 

HTTP-based malware is becoming more prevalent. 
For example, according to [20] the majority of spam 
botnets use HTTP to communicate with their command 
and control (C&C) server. Also, from our own mal- 
ware database, we found that among the malware sam- 
ples that show network activities, about 75% of them 
generate some HTTP traffic. In addition, there is evi- 
dence that Web-based “reusable” kits (or platforms) for 
remote command of malware, and in particular botnets, 
are available for sale on the Internet [14] (e.g., the C&C 
Web kit for Zeus bots can be currently purchased for 
about $700 [8]). 

Given a large dataset of malware samples and the ma- 
licious HTTP traffic they generate, our network-level be- 
havioral clustering system aims at unveiling similarities 
(or relationships) among malware samples that may not 
be captured by current system-level behavioral clustering 
systems [9, 10], thus offering a new point of view and 
valuable information to malware analysts. Unlike pre- 
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vious work on behavioral malware clustering, our work 
is motivated by the need to provide quality input to al- 
gorithms that automatically generate network signatures. 
Accordingly, we define similarity metrics among HTTP 
traffic traces and develop our clustering system so that 
the resulting clusters can yield high quality malware sig- 
natures. Namely, after clustering is completed, the HTTP 
traffic generated by malware samples in the same cluster 
can be processed by an automatic signature generation 
tool, in order to extract network signatures that model the 
HTTP behavior of all the malware variants in that cluster. 
An Intrusion Detection System (IDS) located at the edge 
of a network can in turn deploy such network signatures 
to detect malware-related outbound HTTP requests. 

The main contributions of this paper are as follows: 

e We propose a novel network-level behavioral mal- 
ware clustering system based on the analysis of 
structural similarities among malicious HTTP traf- 
fic traces generated by different malware samples. 

e We introduce a new automated method for ana- 
lyzing the results of behavioral malware clustering 
based on a comparison with family names assigned 
to the malware samples by multiple AVs. 

e We show that the proposed system enables accurate 
and efficient automatic generation of network-level 
malware signatures, which can complement tradi- 
tional AVs and other defense techniques. 

e We implemented a proof-of-concept version of our 
malware clustering system and performed experi- 
ments with more than 25,000 malware samples. Re- 
sults from our evaluation, which includes real-world 
deployment, confirm the effectiveness of the pro- 
posed clustering system. 


2 Related Work 


System-level behavioral malware clustering has been 
recently studied in [9, 10]. In particular, Bayer et al. [10] 
proposed a scalable malware clustering algorithm based 
on malware behavior expressed in terms of detailed sys- 
tem events. However, the network information they use 
is limited to high-level features such as the names of 
downloaded files, the type of protocol, and the domain 
name of the server. Our work is different because we fo- 
cus on the malicious HTTP traffic traces generated by 
executing different malware samples. We extract de- 
tailed information from the network traces, such as the 
number and type of HTTP queries, the length and struc- 
tural similarities among URLs, the length of data sent 
and received from the HTTP server, etc. Compared with 
Bayer et al. [10], we do not consider the specific TCP 
port and domain names used by the malware. We aim 
to group together malware variants that may contact dif- 
ferent web servers (e.g., because they are controlled by 
a different attacker), and may or may not use an HTTP 
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proxy (whereby the TCP port used may vary), but have 
strong similarities in terms of the structure and sequence 
of the HTTP queries they perform (e.g., because they rely 
on the same C&C Web kit). Also, we develop our behav- 
ioral clustering algorithm so that the results can be used 
to automatically generate network signatures for detect- 
ing malicious network activities, as opposed to system- 
level signatures. 

Automatic generation of network signatures has been 
explored in various previous work [23, 24, 29, 32, 33]. 
Most of these studies focused mainly on worm finger- 
printing. Different approaches have been proposed to 
deal with generating signatures from a dataset of network 
flows related to the propagation of different worms. In 
particular, Polygraph [24] applies clustering techniques 
to try to separate worm flows belonging to different 
worms, before generating the signatures. However, Poly- 
graph’s clustering algorithm is greedy and becomes pro- 
hibitively expensive when dealing with the high number 
of malicious flows generated by a large dataset of differ- 
ent types of malware, as we will discuss in Section 6.2. 
Since behavioral malware clustering aims at efficiently 
clustering large datasets of different malware samples 
(including bots, adware, spyware, etc., beside Worms), 
the clustering approaches proposed for worm fingerprint- 
ing are not suitable for this task. Compared with [24] and 
other previous work on worm fingerprinting, we focus 
on clustering of different types of HT'TP-based malware 
(not only worms) in an efficient manner. 

BotMiner [15], an anomaly-based botnet detection 
system, applies clustering of network flows to detect 
the presence of bot-compromised machines within en- 
terprise networks. BotMiner uses high-level statistics for 
clustering network flows, and is limited to detecting bot- 
nets. On the other hand, in this paper we focus on the 
behavioral clustering of generic malware samples based 
on structural similarities among their HTTP traffic traces, 
and on modeling the network behavior of the discovered 
malware families by extracting network-level malware 
detection signatures. 


3 HTTP-Based Behavioral Clustering 


The objective of our system is to find groups of mal- 
ware that interact with the Web in a similar way, learn 
a network behavior model for each group (or family) of 
malware, and then use such models to detect the pres- 
ence of malware-compromised machines in a monitored 
network. Towards this end, we first perform behavioral 
clustering of malware samples by finding structural sim- 
ilarities between the sequences of HTTP requests gen- 
erated as a consequence of infection. Namely, given a 
dataset of malware samples M = fmO\ 4. N, We ex- 
ecute each sample m” in a controlled environment sim- 
ilar to BotLab [20] for a time 7’, and we store its HTTP 
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Figure 1: Overview of our HTTP-based behavioral malware clustering system. 


traffic trace H(m")). We then partition M into clusters 
according to a definition of structural similarity among 
the HTTP traffic traces H(m),7 = 1,..,.N. 


3.1 System Overview 


To attain high-quality clusters and decrease the com- 
putational cost of clustering, we adopt a multistep 
cluster-refinement process, as shown in Figure 1: 


e Coarse-grained Clustering: In this phase, we clus- 
ter malware samples based on simple statistical fea- 
tures extracted from their malicious HTTP traffic. 
We measure features such as the total number of 
HTTP requests the malware generated, the number 
of GET and POST requests, the average length of 
the URLs, etc. Therefore, computing the distance 
between pairs of malware samples reduces to com- 
puting the distance between (short) vectors of num- 
bers, which can be done efficiently. 


e Fine-grained Clustering: After splitting the col- 
lected malware set into relatively large (coarse- 
grain) clusters, we further split each cluster into 
smaller groups. To this end, we consider each 
coarse-grained cluster as a separate malware set, 
measure the structural similarity between the HTTP 
traffic generated by each sample in a cluster, and 
apply fine-grained clustering. This allows us to 
separate malware that have similar statistical traf- 
fic characteristics (thus causing them to fall in the 
same coarse-grained cluster), but that present dif- 
ferent structures of their HTTP queries. Measuring 
the structural similarity between pairs of HTTP traf- 
fic traces is relatively expensive. Since each coarse- 
grained cluster is much smaller than the total num- 
ber of samples N, fine-grained clustering can be 
done more efficiently than by applying it directly 
on the entire malware dataset. 


e Cluster Merging: The fine-grained clustering tends 
to produce “tight” clusters of malware that have 
very similar network behavior. However, one of 
our objectives is to derive generic behavior models 
that can be used to detect the network behavior of 


a large number of current and future malware sam- 
ples. To achieve this goal, after fine-grained cluster- 
ing we perform a further refinement step in which 
we try to merge together clusters of malware that 
have similar enough HTTP behavior, but that have 
been split by the fine-grained clustering process. In 
practice, given a set of fine-grained malware clus- 
ters, for each of them we define a cluster centroid 
as a set of network signatures that “summarize” the 
HTTP traffic generated by the malware samples in 
a cluster. We then measure the similarity between 
pairs of cluster centroids, and merge fine-grained 
clusters whose centroids are close to each other. 


The combination of coarse-grained and fine-grained 
clustering allows us to decrease the computational cost 
of the clustering process, compared to using only fine- 
grained clustering. Furthermore, the cluster merging pro- 
cess allows us to attain more generic network-level mal- 
ware signatures, thus increasing the malware detection 
rate (see Section 6.2). These observations motivate the 
use of our three-step clustering process. 

In all the three phases of our clustering system, we ap- 
ply single-linkage hierarchical clustering [19]. The main 
motivations for this choice are the fact that the hierar- 
chical clustering algorithm is able to find clusters of ar- 
bitrary shapes, and can work on arbitrary metric spaces 
(1.e., itis not limited to distance in the Euclidean space). 
We ran pilot experiments using other clustering algo- 
rithms (e.g., X-means [27] for the coarse-grained cluster- 
ing, and complete-linkage hierarchical clustering [19]). 
The single-linkage hierarchical clustering performed the 
best, according to our analysis. 

The hierarchical clustering algorithm takes a matrix of 
pair-wise distances among objects as input and produces 
a dendrogram, 1.¢., a tree-like data structure where the 
leaves represent the original objects, and the length of 
the edges represent the distance between clusters [18]. 
Choosing the best clustering involves a cluster validity 
analysis to find the dendrogram cut that produces the 
most compact and well separated clusters. In order to 
automatically find the best dendrogram cut we apply the 
Davies-Bouldin (DB) cluster validity index [17]. We 
now describe our clustering system more in detail. 
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3.2 Coarse-grained Clustering 


The goal of coarse-grained clustering is to find sim- 
ple statistical similarities in the way different malware 
samples interact with the Web. Let M = {moO 14. N 
be a set of malware samples, and H(m)) be the HTTP 
traffic trace obtained by executing a malware m™ © M 
for a given time 7’. We translate each trace H(m“)) into 
a pattern vector uv“) containing the following statistical 
features to model how each malware uses the Web: 

1. Total number of HTTP requests 
Number of GET requests 
Number of POST requests 
Average length of the URLs 
Average number of parameters in the request 
Average amount of data sent by POST requests 

7. Average response length 
Because the range of different features in the pattern vec- 
tors are quite different, we first standardize the dataset so 
that the features will have mean equal to zero and vari- 
ance equal to one, and then we apply the Euclidian dis- 
tance. We partition the set M/ into coarse-grained clus- 
ters by applying the single-linkage hierarchical cluster- 
ing algorithm and DB index [17] cluster validity analy- 
SIS. 


a ee 


3.3. Fine-grained Clustering 


In the fine-grained clustering step, we consider the 
structural similarity among sequences of HTTP requests 
(as opposed to the statistical similarity used for coarse- 
grained clustering). Our objective is to group together 
malware that interact with Web applications in a simi- 
lar way. For example, we want to group together bots 
that rely on the same Web-based C&C application. Our 
approach is based on the observation that two different 
malware samples that rely on the same Web server appli- 
cation will query URLs structured in a similar way, and 
in a similar sequence. In order to capture these similar- 
ities, we first define a measure of distance between two 
HTTP requests r;, and r;, generated by two different mal- 
ware samples. Consider Figure 2, where m, p, n, and v, 
represent different parts of an HTTP request: 

e m represents the request method (e.g., GET, POST, 
HEADER, etc.). We define a distance function 
dm(Tk, Tn) that is equal to 0 if the requests r;,, and 
rp, both use the same method (e.g, both are GET 
requests), otherwise it is equal to 1. 

e p stands for page, namely the first part of the URL 
that includes the path and page name, but does not 
include the parameters. We define d,(rx, rp) to be 
equal to the normalized Levenshtein distance! be- 


'The normalized Levenshtein distance between two strings s1 and 
s2 (also known as edit distance) is equal to the minimum number of 
character operations (insert, delete, or replace) needed to transform one 
string into the other, divided by max(length(s1),length(s2)). 
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Figure 2: Structure of an HTTP request used in fine- 
grained clustering. m=Method; p=Page; n=Parameter Names; 
v=Parameter Values. 





























tween the strings related to the path and pages that 
appear in the two requests rz; and rp. 

e n represents the set of parameter names (1.e., n = 
{id, version, cc} in the example in Figure 2). We 
define d,,(r%,77,) as the Jaccard distance* between 
the sets of parameters names in the two requests. 

e v is the set of parameter values. We define 
dy(Tx~,7n) to be equal to the normalized Leven- 
shtein distance between strings obtained by con- 
catenating the parameter values (e.g., 0011.0US). 


We define the overall distance between two HTTP re- 
quests as 


dr (re, Th) =Wm ° dm(rk, Th) + Wp ° dp(TksTh) 
+wn-dn(re, Th) + Wy: du(rz, Th) 


(1) 


where the factors w,,x € {m,p,n,v} are predefined 
weights (the actual value assigned to the weights w,. are 
discussed in Section 6) that give more importance to the 
distance between the request methods and pages, for ex- 
ample, and less weight to the distance between parameter 
values. We then define the fine-grain distance between 
two malware samples as the average minimum distance 
between sequences of HTTP requests from the two sam- 
ples, and apply the single-linkage hierarchical cluster- 
ing algorithm and the DB cluster validity index [17] to 
split each coarse-grained cluster into fine-grained clus- 
ters (we only split coarse-grained clusters whose diame- 
ter is larger than a predefined threshold 6 = 0.1). 


3.4 Cluster Merging 


Fine-grained clustering tends to produce tight clusters, 
which yield specific malware signatures. However, our 
objective is to derive generic malware signatures which 
can be used to detect as many future malware variants 
as possible, while maintaining a very low false positive 
rate. Towards this end, we apply a further refinement 
step in which we merge together fine-grained clusters of 
malware variants that behave similarly enough, in terms 
of the HTTP traffic they generate. For each fine-grained 
malware cluster we compute a cluster centroid, which 
summarizes the HTTP requests performed by the mal- 
ware samples in a cluster, and then we define a measure 
of distance among centroids (and therefore among clus- 
ters). The cluster merging phase is a meta-clustering step 
in which we find groups of malware clusters that are very 


2The Jaccard oa between two sets A and B is defined as 


_;_ (An 
J(A,B) =1- 403 
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a) GET /.*/command\.php\?id=1\..*&version=.*&cc=.* 





b) GET //command.php?id=1.&version=&cc= 
Figure 3: Example of network signature (a), and its plain text 
version (b). 


close to each other, and we merge them to form bigger 
clusters. 


Cluster Centroids Let C; be a cluster of malware 
samples, and H; = {H(m\)) }pat..c, the related set of 
HTTP traffic traces obtained by executing each malware 
sample in C;. We define the centroid of C; as a set 
S; = {s;}j=1.1, of network signatures. Each signature 
s; is extracted from a pool p; of HTTP requests selected 
from the traffic traces in H;. We first describe the algo- 
rithm used for creating the set of HTTP request pools, 
and then we describe how the signatures are extracted 
from the obtained pools. 

To create a set P; of request pools, we first randomly 
select one of the malware samples in cluster C; to be our 
centroid seed. Assume we pick m\") for this purpose. We 
then consider the set of HTTP requests in the HTTP traf- 
fic trace H(m\”) = {rj;};=1.1,. We initialize the pool 
set P; by putting each request r; in a different (until now 
empty) pool p;. Now, using the definition of distance 
between HTTP requests in Equation 1, for each request 
r; € H(m\”) we find the closest request r, € H(m{”) 
from another malware sample mi) € C,, and we add 
ri, to the pool p;. We repeat this for all the malware 
mi? € C;,g # h. After this process is complete, and 
pool p; has been filled with HTTP requests, we reiterate 
the same process to construct pool pj; starting from re- 
quest rj, € H(m\), until all pools p;,7 = 1,..,1; have 
been filled. 

After the pools have been filled with HTTP requests, 
we extract a signature s; from each pool p; € P; using 
the Token-Subsequences algorithm implemented in [24] 
(Token-Subsequences signatures can be easily translated 
into Snort signatures). Since the signature generation al- 
gorithm itself is not a contribution of this paper, we refer 
the reader to [24] for more details on how the Token Sub- 
sequences signatures are generated. Here it is sufficient 
to notice that a Token Subsequences signature is an or- 
dered list of invariant tokens, i.e, substrings that are in 
common to all the requests in a request pool p. There- 
fore, a signature s; can be written as a regular expres- 
sion of the kind t1.*t2.*...*tmn, where the t’s are 
invariant tokens that are common to all the requests in 
the pool p;. We consider only the first part of each HTTP 
request for signature generation purposes, namely, the re- 
quest method and URL (see Figure 3a). 


Meta-Clustering After a centroid has been computed 
for each fine-grained cluster, we can compute the dis- 


tance between pairs of centroids d(S;,S,),. We first de- 
fine the distance between pairs of signatures, and then 
we extend this definition to consider sets of signatures. 
Let s; be a signature, and 8; be a plain text concate- 
nation of the invariant tokens in signature s;. For ex- 
ample, t1t2t3 is a plain text version of the signature 
t1l.*«t2.«*t3 (see Figure 3 for a concrete example). 
We define the distance between two signatures as 


agrep( si, 55) 


length(s*) or 


d(s;, S i) = 
where agrep(s;, 5) is a function that performs approxi- 
mate matching of regular expression [31] of the signature 
s; on the string Si, and returns the number of matching 
errors. In practice, d(s;, 8; ) is equal to zero when s, per- 
fectly “covers” (i.e., is more generic than) s;, and tends 
to one when signatures s; and s; are more and more dif- 
ferent. 

Given the above definition of distance between signa- 
tures, we define the distance between two centroids (i.e., 
two clusters) as the minimum average distance between 
two sets of signatures®. It is worth noting that when 
computing the distance between two centroids, we only 
consider those signatures s; for which length(s;,) > X. 
Here sj, is again the plain text version of s,, length(s', ) 
is the length of the string s,, and \ is a predefined length 
threshold. The threshold A is chosen to avoid apply- 
ing the agrep function on short, and therefore likely 
too generic, signatures that would match most HTTP re- 
quests (e.g., Ss, = GET /.x), thus artificially skewing 
the distance value towards zero. 

We then apply again the hierarchical clustering algo- 
rithm in combination with the DB validity index [17] to 
find groups of clusters (or meta-clusters) that are close to 
each other and should therefore be merged. 


4 Network Signatures 


The cluster-merging step described in Section 3.4 rep- 
resents the last phase of our behavioral clustering pro- 
cess, and its output represents the final partitioning of 
the original malware set M = {m \=1..n into groups 
of malware that share similar HTTP behavior. Now, for 


each of the final output clusters C’,2 = 1,..,c, we can 


4) 
compute an “updated” centroid signature set S‘ using the 
same algorithm described in Section 3.4 for computing 
cluster centroids. The signature set S; can then be de- 
ployed into an IDS at the edge of a network in order to 
detect malicious HTTP requests, which are a symptom 
of malware infection. 

It is important to notice that some malware samples 


may contact legitimate websites for malicious purposes. 


3Formally 
d(Si,8j) = min { + 0, ming {d(si,8;)}, + DL, mini{d(s,, s:)}} 
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For example, some botnets use facebook ortwitter 
for C&C [3]. To decrease the possibility of false pos- 
itives, one may be tempted to prefilter all the HTTP 
requests sent by malware samples against well known, 
legitimate websites before generating the network sig- 
natures. However, prefiltering all the HTTP requests 
against these websites may not be a good idea because 
we may discard HTTP requests that, although “served” 
by legitimate websites, are specific to certain malware 
families and whose related network signatures may yield 
a high detection rate with low false positives. To solve 
this problem, instead of prefiltering HTTP traffic against 
legitimate websites, we apply a signature pruning pro- 
cess by testing the signature set S; against a large dataset 
of legitimate traffic and discard the signatures that gen- 
erate false positives. 


5 Cluster Validity Analysis 


Clustering can be viewed as an unsupervised learning 
task, and analyzing the validity of the clustering results is 
intrinsically hard. Cluster validity analysis often involves 
the use of a subjective criterion of optimality [19], which 
is specific to a particular application. Therefore, no stan- 
dard way exists of validating the output of a clustering 
procedure [19]. As discussed in Section 3, we make use 
of the DB validity index [17] in all the phases of our mal- 
ware clustering process to automatically choose the best 
possible partitioning of the malware dataset. However, 
it is also desirable to analyze the clustering results by 
quantifying the level of agreement between the obtained 
clusters and the information about the clustered malware 
samples given by different AV vendors, for example. 

Bayer et al. [10] proposed to use precision and recall 
(which are widely used in text classification problems, 
for example, but not as often for cluster validity analy- 
sis) to compare the results of their system-level behav- 
ioral clustering system to a reference clustering. Gen- 
erating such reference clustering is not easy because the 
labels assigned by different AV scanners to variants of 
the same malware family are seldom consistent. This 
required Bayer et al. [10] to define a mapping between 
labels assigned by different AVs. 

We propose a new approach to analyze the validity of 
malware clustering results, which does not require any 
manual mapping of AV labels. Our approach is based on 
a measure of the cohesion (or compactness) of each clus- 
ter, and the separation among different clusters. We mea- 
sure both cohesion and separation in terms of the agree- 
ment between the labels assigned to the malware samples 
in a cluster by multiple AV scanners. It is worth noting, 
though, that since the AV labels themselves are not al- 
ways consistent (as observed in [9, 10]), our measures 
of cluster cohesion and separation give us an indication 
of the validity of the clustering results, rather than be- 
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ing an oracle. However, we devised our cluster cohesion 
and separation indices to mitigate possible inconsisten- 
cies among AV labels. 


AV Label Graphs Before describing how cluster co- 
hesion and separation are measured, we need to intro- 
duce the notion of AV label graph. We introduce AV 
label graphs to mitigate the effects of the inconsistency 
of AV labels, and to map the problem of measuring the 
cohesion (or compactness) and separation of clusters in 
terms of easier-to-handle graph-based indices. We first 
start with an example to show how to construct the AV 
label graph given a cluster of malware samples. We then 
provide a more formal definition. 


Consider the example of malware cluster in Figure 4a, 
which contains eight malware samples (one per line). 
Each line reports the MDS hash of a malware sample, 
and the AV labels assigned to the sample by three dif- 
ferent AV scanners (McAfee [4], Avira [1], and Trend 
Micro [7]). From this malware cluster we construct an 
AV label graph as follows: 


1. Create a node in the graph for each distinct AV 
malware family label (we identify a malware fam- 
ily label by extracting the first AV label substring 
that ends with a *” character). For example 
(see Figure 4b), the first malware sample is clas- 
sified as belonging to the W32/Virut family by 
McAfee, WORM/Rbot by Avira, and PE_VIRUT 
by Trend Micro. Therefore we create three 
nodes in the graph called McAfee _W32_Virut, 
Avira_WORM_Rbot, and Trend_PE_VIRUT (in 
case a malware sample is not detected by an AV 
scanner, we map it to a special null label). 

2. Once all the nodes have been created, we connect 
them using weighted edges. We connect two nodes 
with an edge only if the related two malware family 
labels (1.e., the name of the nodes) appear together 
in at least one of the lines in Figure 4a. 

3. A weight equal to 1 — 7 is assigned to each edge, 
where m represents the number of times the two 
malware family labels connected by the edge have 
appeared on the same line in the cluster (1.e., for the 
same malware sample), and 7 is the total number of 
samples in the cluster (n = 8 in this example). 


As we can see from Figure 4b, the nodes 
McAfee_W32_Virut and Trend_PE_VIRUT re 
connected by an edge with weight equal to zero because 
both McAfee and Trend Micro consistently classify 
each malware sample in the cluster as W32/Virut and 
PE_VIRUT, respectively (1.e., m = mn). On the other 
hand, the edge between nodes McAfee_W32_Virut 
and Avira_W32_Virut, for example, was assigned a 
weight equal to 0.625 because in this case m = 3. We 
now define AV label graphs more formally. 
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(b) AV Label Graph 
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Figure 4: Example of Malware Cluster (a) and related AV Label Graph (b). Each malware sample (identified by its MD5 hash) is labeled 
using three different AV scanners, namely McAfee (m), Avira (a), and Trend Micro (t). 


Definition 1 - AV Label Graph. An AV label graph 
is an undirected weighted graph. Given a malware 
cluster C; {mp1 ey let T; = {ly 
(I1,--,dy)1,-, Le, = (hi,..,lu)c, } be a set of label vec- 
tors, where label vector Ly, = (l1,..,ly)n is the set of 
malware family labels assigned by v different AV scan- 
) ~ C;. The AV label graph 


ners to malware ™,, 
G; = Vo) By pha is constructed by adding a 


node vi? for each distinct malware family label ly, € 1;. 
Two nodes Vi and V0 are connected by a weighted 


edge EB,” p, Only if the malware family labels ly, and lx, 
related to the two nodes appear at least once in the same 
label vector Ly, € \;. Each edge E,” ky §S assigned a 
weight w = 1 — ae where m is equal to the number of 
label vectors Ly, € V; containing both l,, and ly, and 
c;, is the number of malware samples in C;. 


Cluster Cohesion and Separation Now that we have 
defined AV label graphs, we can formally define cluster 
cohesion and separation in terms of AV labels. 


Definition 2 - Cohesion Index. Given a cluster C;, let 
G,; = {Ve?, Ee. beatt be its AV label graph, and 


01,,1, be the shortest path between two nodes A and 


ve in G;. If no path exists between the two nodes, 
the distance 0),,1, is assumed to be equal to a constant 
“gap” y > sup(Wk,,K.), Where Wr, kb, is the weight of a 
generic edge ES ko © Yi. The cohesion index of cluster 
C; is defined as 


> 01, jly 


1) Sho 
where n is the number of malware samples in the cluster, 
and v is the number of different AV scanners. 


According to our definition of AV label graph, 
sup(Wk,,k.) = 1, and we set + 10. In practice, 
the cohesion index C(C;) € (0,1) will be equal to one 
when each AV scanner consistently assigns the same 
malware family label to each of the malware samples 
in cluster C;. On the other hand the cohesion index 
will tend to zero if each AV scanner assigns different 


malware family labels to each of the malware samples 
in the cluster. For example, the graph in Figure 4b 
has a cohesion index equal to 0.999. The cohesion in- 
dex is very high thanks to the fact that both McAfee 
and Trend Micro consistently assign the same family 
label to all the samples in the cluster. If Avira also 
consistently assigned the same family label to all the 
samples (either always Avira_W32_Virut or always 
Avira_W32_Rbot), the cohesion index would be equal 
to one. As we can see, regardless of the inconsistency in 
Avira’s labels, thanks to the fact that we use multiple AV 
scanners and we leverage the notion of AV label graphs, 
we can correctly consider the cluster in Figure 4a as very 
compact, thus confirming the validity of the behavioral 
clustering process. 


Definition 3 - Separation Index. Given two clusters C; 
and ©, and their respective label graphs G; and G;, let 
C;,; be the cluster obtained by merging C; and C,, and 
G;; be its label graph. By definition, G;; will contain all 


the nodes Vv? E G; and v9  ~ G;. The separation 
index between ©; and C, is defined as 


1 i 
8(Ci,C,) = — aver, n{A(Ve?, V)} 


where AV”, v4 )) is the shortest path in G,; between 


nodes Vv? and v3 ) and ‘y is the “gap” introduced in 
Definition 2. 


In practice, the separation index takes values in the 
interval [0,1]. S(C;,C,;) will be equal to zero if the 
malware samples in clusters C; and C; are all consis- 
tently labeled by each AV scanner as belonging to the 
same malware family. Higher values of the separation 
index indicate that the malware samples in C; and C; 
are more and more diverse in term of malware family la- 
bels, and are perfectly separated (ie., S(C;,C;) = 1) 
when no intersection exists between the malware family 
labels assigned to malware samples in C,, and the ones 
assigned to malware samples in C,. 


6 Experiments 


In this section we present our experimental results. 
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Malware Samples 
undetected by all AVs 
208 (4.4%) 

252 (7.1%) 

142 (6.2%) 

997 (20.5%) 

1,038 (22.2%) 

1,569 (28.1%) 


samples 





undetected by best AV 
327 (6.9%) 

302 (8.6%) 

175 (7.7%) 

1,127 (23.2%) 

1,164 (24.9%) 

1,665 (29.8%) 


Number of Clusters 
coarse fine meta 


Processing Time 

coarse fine metatsig 
34min 22min 6h55min 
19min 3min 1h3min 
8min 5min 28min 
56min 8min 2h52min 
57min 3min 37min 
1hS5min 5min 2h22min 


Table 1: Summary of Clustering Results (column meta+sig includes the meta-clustering and signature extraction processing time). 


6.1 HTTP-Based Behavioral Clustering 


Malware Dataset Our malware dataset consists of 
25,720 distinct (no duplicates) malware samples, each of 
which generates at least one HTTP request when exe- 
cuted on a victim machine. We collected our malware 
samples in a period of six months, from February to July 
2009, from a number of different malware sources such 
as MWCollect [2], Malfease [5], and commercial mal- 
ware feeds. Table 1 (first and second column), shows 
the number of distinct malware samples collected in each 
month. Similar to previous works that rely on an analy- 
sis of malware beahavior [9, 10, 20], we executed each 
sample in a controlled environment for a period T’ = 5 
minutes, during which we recorded the HTTP traffic to 
be used for our behavioral clustering (see Section 3). 


To perform cluster analysis based on AV labels, as de- 
scribed in Section 5, we scanned each malware sample 
with three commercial AV scanners, namely McAfee [4], 
Avira [1], and Trend Micro [7]. As we can see from 
Table 1 (third and fourth column), each of our datasets 
contains a number of malware samples which are not de- 
tected by any of our AV scanners. In addition, the num- 
ber of undetected samples grew significantly during the 
last few months, for both the combination of the three 
scanners, and for the single best AV (i1.e., the AV scan- 
ner that overall detected the highest number of samples). 
This is justified by the fact that we scanned all the bina- 
ries in August 2009 using the most recent AV signatures. 
Therefore, AV companies had enough time to generate 
signatures for most malware collected in February, for 
example, but evidently not enough time to generate sig- 
natures for many of the more recent malware samples. 
Given the rapid pace at which new malware samples 
are created [30], and since it may take months for AV 
vendors to collect a specific malware variant and gener- 
ate traditional detection signatures for it, this result was 
somewhat expected and is in accordance with the results 
reported by Oberheide et al. [26]. 


Experimental Setup We implemented a_proof-of- 
concept version of our behavioral clustering system (see 
Section 3), which consists of a little over 2,000 lines of 
Java code. We set the weights defined in Equation | (as 
explained in Section 3.3) to wm = 10, wp = 8, Wn = 3, 
and w, = 1. We set the minimum signature length A 
used to compute the distance between cluster centroids 
(see Section 3.4) to 10. To perform fine- and meta- 
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clustering (see Section 3.4), we considered the first 10 
HTTP requests generated by each malware sample dur- 
ing execution. We performed approximate matching of 
regular expressions (see agrep function in Section 3.4) 
using the TRE library [22]. All the experiments were 
performed on a 4-core 2.67GHz Intel Core-17 machine 
with 12GB of RAM, though we never used more than 2 
cores and 8GB of RAM for each experiment run. 


Clustering Results We applied our behavioral clus- 
tering algorithm to the malware samples collected in 
each of the six months of observation. Table 1 sum- 
marizes our clustering results, and reports the number 
of clusters produced by each of the clustering refine- 
ment steps, i.e., coarse-grain, fine-grained, and meta- 
clustering (see Section 3). For example, in February 
2009, we collected 4,758 distinct malware samples. The 
coarse-grained clustering step grouped them into 2,538 
clusters, the fine-grained clustering further split some of 
these clusters to generate a total of 2,660 clusters, and 
the meta-clustering process found that some of the fine- 
grained clusters could be merged to produce a final num- 
ber of 1,499 (meta-)clusters. Table 1 also reports the 
time needed to complete each step of our clustering pro- 
cess. The most expensive step is almost always the meta- 
clustering (see Section 3.4) because measuring the dis- 
tance between centroids requires using the agrep func- 
tion for approximate matching of regular expressions, 
which is relatively expensive to compute. However, com- 
puting the clusters for one month of HTTP-based mal- 
ware takes only a few hours. The variability in cluster- 
ing time is due to the different number of samples per 
month, and by the different amount of HTTP traffic they 
generated during execution. Further optimizations of our 
clustering system are left as future work. 

Table 7 (first and second row) shows, for each month, 
the number of clusters and the clustering time obtained 
by directly applying the fine-grained clustering step 
alone to our malware datasets (we will explain the mean- 
ing of the last row of Table 7 later in Section 6.2). We can 
see from Table | that the combination of coarse-grained 
and fine-grained clustering requires a lower computation 
time, compared to applying fine-grained clustering by 
itself. For example, according to Table 1, computing 
the coarse-grained clusters first and then refining the re- 
sults using fine-grained clustering on the FebO9 dataset 
takes 56 minutes. On the other hand, according to Ta- 
ble 7, applying fine-grained clustering directly on Feb09 


USENIX Association 


clusters 
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Vv, = 7ee251d8d13ed32914a4e39740b91ae2 
AV Labels = DR/PCK.Tdss.A.21 [Avira] 








V, = 076b81e8c6622e9c6a94426e8c2dfe33 


AV Labels = Generic FakeAlert.h [McAfee]; 
TR/Dropper.Gen [Avira] 





HTTP Traffic 


[1249356561 192.168.14.2:1037 => 94.247.2.193:80] 
POST /cgi-bin/generator HTTP/1.0 


HTTP Traffic 


[1249345674 192.168.12.2:1034 => 94.247.2.193:80] 
POST /cgi-bin/generator HTTP/1.0 

Content-Length: 45 

[... DATA...] 


[1249345674 192.168.12.2:1038 => 94.247.2.193:80] 
POST /extra.php HTTP/1.0 

Content-Type: application/x-www-form-urlencoded 
Content-Length: 44 


USENIX Association 


Content-Length: 45 
[... DATA ...] 


AV-label cohesion 


POST /extra.php HTTP/1.0 


Content-Length: 44 
[... DATA ...] 


[1249356562 192.168.14.2:1038 => 94.247.2.193:80] 


Content-Type: application/x-www-form-urlencoded 


[... DATA...] 





File System Operations 


Delete c:\docume~1\admini~1\locals~1\temp\tmp4.tmp 
Delete c:\docume~1\admini~1\locals~1\temp\tmp5.tmp 
Write c:\docume~1\admini~1\locals~1\temp\tmp5.tmp 
Read _ \\?\globalroot\systemroot\system32\advapi32.dll 





density 











File System Operations 


Delete c:\docume~1\admini~1\locals~1\temp\tmp1.tmp 
Read _\\?\globalroot\systemroot\system32\msvert.dll 
Write c:\docume~1\admini~1\locals~1\temp\tmp1.tmp 


Write c:\docume~1\admini~1\locals~1\temp\tmp4.tmp 

Write c:\docume~1\admini~1\locals~1\temp\nso3.tmp\modern-header.bmp 
Delete c:\docume~1\admini~1\locals~1\temp\nso3.tmp 

Write c:\docume~1\admini~1\locals~1\temp\matrix329411.exe 

Read (MALWARE_PATH) 

Delete c:\docume~1\admini~1\locals~1\temp\nsc1.tmp 




















AV-label separation 
Figure 5: Distribution of cluster co- 
hesion and separation (Feb09). 


requires more than 4 hours. Furthermore, although ap- 
plying fine-grained clustering by itself requires less time 
than our three-step clustering approach (which includes 
meta-clustering) for three out of six datasets, our three- 
step clustering yields better signatures and a higher mal- 
ware detection rate in all cases, as we discuss in Sec- 
tion 6.2. 


To analyze the quality of the final clusters generated 
by our system we make use of the cluster cohesion and 
separation defined in Section 5. Figure 5 shows a his- 
togram of the cohesion index values (top graph) com- 
puted for each of the clusters obtained from the Feb09 
malware dataset, and the distribution of the separation 
among clusters (bottom graph). Because of space limi- 
tations, we only discuss the cohesion and separation re- 
sults from Feb09. The cohesion histogram only consid- 
ers clusters that contain two or more malware samples 
(clusters containing only one sample have cohesion equal 
to 1 by definition). Ideally, we would like the value of 
cohesion for each cluster to be as close as possible to 
1. Figure 5 confirms the effectiveness of our clustering 
approach. The vast majority of the behavioral clusters 
generated by our clustering system are very compact in 
terms of AV label graphs. This shows a strong agree- 
ment between our results and the malware family labels 
assigned to the malware samples by the AV scanners. 


Figure 5 also shows the distribution of the separation 
between pairs of malware clusters. Ideally we would like 
all the pairs of clusters to be perfectly separated (..e., 
with a separation index equal to 1). Figure 5 (bottom 
graph) shows that most pairs of clusters are relatively 
well separated from each other. For example, 90% of 
all the cluster pairs from Feb09 have a separation index 
higher than 0.1. Both cluster cohesion and separation 


(a) Variant 1 


Figure 6: Example of malware variants that generate the very same network traffic, but also 
generate significantly different system events. 


(b) Variant 2 


provide a comparison with the AV labels, and although 
our definition of cohesion and separation indexes attenu- 
ates the effect of AV label inconsistency (see Section 5), 
the results ultimately depend on the quality of the AV la- 
bels themselves. For example, we noticed that most pairs 
of clusters that have a low separation are due to the fact 
that their malware samples are labeled by the AV scan- 
ners as belonging to generic malware families, such as 
“Generic”, “Downloader”, or “Agent”. 


Overall, the distributions of the cohesion and separa- 
tion indexes in Figure 5 show that most of the obtained 
behavioral malware clusters are very compact and fairly 
well separated, in terms of AV malware family labels. By 
combining this automated analysis with the manual anal- 
ysis of those cases in which the separation index seemed 
to disagree with our clustering, we were able to confirm 
that our network-level clustering approach was indeed 
able to accurately cluster malware samples according to 
their network behavior. 


6.2 Network Signatures 


In this section, we discuss how our network-level be- 
havioral malware clustering can aid the automatic gener- 
ation of network signatures. The main idea is to period- 
ically extract signatures from newly collected malware 
samples, and to measure the effectiveness of such signa- 
tures for detecting the malicious HTTP traffic generated 
by current and future malware variants. 


Table 2 summarizes the results of the automatic signa- 
ture generation process. For each month worth of HTTP- 
based malware, we considered only the malware clusters 
containing at least 2 samples. We do not consider signa- 
tures from singleton clusters because they are too specific 
and not representative of a family of malware. We ex- 
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signatures _— pruned sig. 


clusters (n > 1) 


samples 





Table 2: Automatic signature generation and pruning results 
(processing times for signature extraction are included in Table 1, 
meta+sig column). 


Feb09 
85.9% 


Mar09 
50.4% 
64.2% 


Apr09 
47.8% 
38.1% 
63.1% 


May09 
27.0% 
25.6% 
26.4% 
59.5% 


Jun09 Jul09 
Sig_Feb09 
SigMar09 


Sig_Apr09 


SigMay09 

Sig_Jun09 

Sig_Jul09 - - - 
Table 3: Signature detection rate on current and future malware 
samples (1 month training) 





tracted a signature set from each of the considered clus- 
ters as explained in Section 4. For example (Table 2, row 
1), for the Feb09 malware dataset our clustering system 
found 235 clusters that contained at least 2 malware sam- 
ples. The cumulative number of distinct samples con- 
tained in the 235 clusters was 3,494, from which the au- 
tomatic signature generation process extracted a total of 
544 signatures. After signature pruning (explained be- 
low) the number of signatures was reduced to 446. 

To perform signature pruning (see Section 4), we pro- 
ceeded as follows. We collected a dataset of legitimate 
traffic by sniffing the HTTP requests crossing the web- 
proxy of a large, well administered enterprise network 
with strict security policies for about 2 days, between 
November 25 and November 27, 2008. The collected 
dataset of legitimate traffic contained over 25.3 - 10° 
HTTP requests from 2,010 clients to thousands of dif- 
ferent Web sites. We used existing automatic techniques 
for detecting malicious HTTP traffic and manual analysis 
to confirm that the collected HTTP traffic was actually 
as clean as possible. We split this dataset in two parts. 
We used the first day of traffic for signature pruning, and 
the second day to estimate the false positive rate of our 
pruned signatures (we will discuss our findings regard- 
ing false positives later in this section). To prune the 544 
signatures extracted from Feb09, we translated the signa- 
tures in a format compatible with Snort [6], and then we 
used Sort’s detection engine to run our signatures over 
the first day of legitimate traffic. We then pruned those 
signatures that generated any alert, thus leaving us with 
446 signatures. We repeated this pruning process for all 
the signature sets we extracted from the other malware 
datasets. In the following, we will refer to the pruned 
set of signatures extracted from Feb09 as Sig_Feb09, and 
similarly for the other months Sig-Mar09, Sig_Apr09, 
etc. 


Detection Rate We measured the ability of our signa- 
tures to detect current and future malware samples. We 
measured the detection rate of our automatically gener- 


NSDI ’10: 7th USENIX Symposium on Networked Systems Design and Implementation 


ated signatures as follows. Given the signatures in the 
set Sig_Feb09, we matched them (using Snort) over the 
HTTP traffic traces generated by malware samples in 
Feb09, Mar09, Apr09, etc. We repeated the same process 
by testing the signatures extracted from a given month on 
the HTTP traffic generated by the malware collected in 
that month and in future months. We consider a mal- 
ware sample to be detected if its HTTP traffic causes 
at least one alert to be raised. The detection results we 
obtained are summarized in Table 3. Take as an exam- 
ple the first row. The signature set Sig_Feb09 “covers” 
(i.e., 1S able to detect) 85.9% of the malware samples col- 
lected in Feb09, 50.4% of the malware samples collected 
in Mar09, 47.8% of the malware samples collected in 
Apr09, and so on. Therefore, each of the signature sets 
we generated is able to generalize to new, never-befor- 
seen malware samples. This is due to the fact that our 
network signatures aim to “summarize” the behavior of a 
malware family, instead of individual malware samples. 
As we discussed before, while malware variants from the 
same family can be generated at a high pace (e.g., using 
executable packing tools [16]), when executed they will 
behave similarly, and therefore can be detected by our 
behavioral network signatures. Naturally, as malware be- 
havior evolves, the detection rate of our network signa- 
tures will decrease over time. Also, our approach is not 
able to detect “unique” malware samples, which behave 
differently from any of the malware groups our behav- 
ioral clustering algorithm was able to identify. Nonethe- 
less, it is evident from Table 3 that if we periodically 
update our signatures with a signature set automatically 
extracted from the most recent malware samples, we can 
maintain a relatively high detection rate on current and 
future malware samples. 


False Positives To measure the false positives gener- 
ated by our network signatures we proceeded as follows. 
For each of the signature sets Sig_Feb09, Sig_Mar09, 
etc., we used Snort to match them against the second 
day of legitimate HTTP traffic collected as described at 
the beginning of this Section. Table 4 summarizes the 
results we obtained. The first row reports the false pos- 
itive rate, measured as the total number of alerts gen- 
erated by a given signature set divided by the number 
of HTTP requests in the legitimate dataset. The num- 
bers between parentheses represent the absolute num- 
ber of alerts raised. On the other hand, the second row 
reports the fraction of distinct source IP addresses that 
were deemed to be compromised, due to the fact that 
some of their HTTP traffic matched any of our signa- 
tures. The numbers between parenthesis represent the 
absolute number of the source IPs for which an alert was 
raised. The results reported in Table 4 show that our sig- 
natures generate a low false positive rate. Furthermore, 
matching our signatures against one entire day of legit- 


USENIX Association 


USENIX Association 


J Siig Feb 09 


FP rate 


SigMar09 
3-107 *% (38) 
0.3% (6) 

10 min 


0% (0) 
0% (0) 
13 min 


Distinct IPs 
Processing Time 


Sig_Apr09 Sig-May09 
% (1) 5:10 °% (6) 
0.05% (1) 

6 min 9 min 


8-10°” 





Sig_Jun09 SigJul09 
2-10~*% (26) 10~*% (18) 
0.4% (9) 0.3% (7) 

12 min 38 min 


0.2% (4) 


Table 4: False positives measured on one day of legitimate traffic (approximately 12M HTTP queries from 2,010 different source IPs). 


imate traffic (about 12M HTTP queries from 2,010 dis- 
tinct source IPs) can be done in minutes. This means that 
we would “keep up” with the traffic in real-time. 


Apr09 May09 Jun09 Jul09 
70.8% 35.6% 364% 35.1% 
61.6% 48.6% 44.7% 

- 62.7% 48.6% 

- 68.6% 


Sig_Feb09-Apr09 
Sig-Mar09-May09 


Sig_Apr09-Jun09 

Sig_May09-Jul09 
Table 5: Signature detection rate on current and future malware 
samples (3 months training) 





Other Detection Results Table 5 shows that if we 

combine multiple signature sets, we can further increase 
the detection rate on new malware samples. For ex- 
ample, by combining the signatures extracted from the 
months of Apr09, May09, and Jun09 (this signature set 
is referred to as Sig_Apr09-Jun09 in Table 5), we can 
increase the “coverage” of the Jun09 malware set from 
58.9% (reported in Table 3) to 62.7%. Also, by testing 
the signature set Sig_Apr09-Jun09 against the malware 
traffic from Jul09 we obtained a detection rate of 48.6%, 
which is significantly higher than the 38.5% detection 
rate obtained using only the Sig_Jun09 signature set (see 
Table 3). In addition, matching our largest set of signa- 
tures Sig_May09-Jul09, consisting of 2,973 distinct Snort 
rules, against one entire day of legitimate traffic (about 
12 million HTTP queries) took less than one hour. This 
shows that our behavioral clustering and subsequent sig- 
nature generation approach, though not a silver bullet, 
iS a promising complement to other malware detection 
techniques, such as AV scanners, and can play an impor- 
tant role in a defense-in-depth strategy. This is also re- 
flected in the results reported in Table 6, which represent 
the detection rate of our network signatures with respect 
to malware samples that were not detected by any of the 
three AV scanners available to us. For example, using the 
signature set Sig_Feb09 , we are able to detect 54.8% of 
the malware collected in Feb09 that were not detected by 
the AV scanners, 52.8% of the undetected (by AVs) sam- 
ples collected in Mar09, 29.4% of the undetected sam- 
ples collected in Apr09, etc. 


Feb09_ = =Mar09 = Apr09  May09 Jun09 Jul09 
Sig_Feb09 

SigMar09 

Sig_Apr09 

Sig_May09 

Sig_Jun09 

Sig_Jul09 


Table 6: Detection rate on malware undetected by all AVs. 

We can see from Table 6 that, apart from the signa- 
tures Sig_Apr09, all the other signature sets allow us to 
detect between roughly 20% and 53% of future (1.e., col- 
lected in the next month, compared to when the signa- 





tures were generated) malware samples that AV scanners 
were not able to detect. We believe the poor performance 
of Sig_Apr09 is due to the lower number of distinct mal- 
ware samples that we were able to collect in Apr09. As 
a consequence, in that month we did not have a large 
enough number of training samples from which to learn 
good signatures. 


6.2.1 Real-World Deployment Experience 


We had the opportunity to test our network signatures 
in a large enterprise network consisting of several thou- 
sands nodes that run a commercial host-based AV sys- 
tem. We monitored this enterprise’s network traffic for 
a period of 4 days, from August 24 to August 28, 2009. 
We deployed our Sig_Jun09 and Sig_Jul09 HTTP signa- 
tures (using Snort) to monitor the traffic towards the en- 
terprise’s web proxy. Overall, our signature set consisted 
of 2,140 Snort rules. We used the first 2 days of monitor- 
ing for signature-pruning (see Section 4), and the remain- 
ing 2 days to measure the number of false positives of the 
pruned signature set. During the pruning period, using a 
web interface to Snort’s logs, it was fairly easy to verify 
that 32 of our rules were actually causing false alerts. We 
then pruned (1.e., disabled) those rules and kept monitor- 
ing the logs for the next 2 days. In this 2-days testing pe- 
riod, overall the remaining signatures generated only 12 
false alerts. During our 4 days monitoring, we also found 
that 4 of our network signatures detected actual malware 
behavior generated from 46 machines. In particular, we 
found that 25 machines were generating HTTP queries 
that matched a signature we extracted from two variants 
of TR/Dldr.Agent.boey [Avira]. By analyzing the 
payload of the HTTP requests we actually found that 
these infected machines seemed to be exfiltrating (POST- 
ing) data to a notoriously spyware-related website. In 
addition, we found 19 machines that appeared to be in- 
fected by rogue AV software, one bot-infected machine 
that contacted its HTTP-based C&C server, and one ma- 
chine that downloaded what appeared to be an update of 
PWS-Banker.gen.dh.dldr [McAfee]. 


6.2.2 Comparison with other approaches. 


Table 7, third row, shows the next month detection 
rate (NMDR) for signatures generated by applying fine- 
grained clustering alone to each of our malware datasets. 
For example, given the malware dataset Feb09, we di- 
rectly applied fine-grained clustering to the related ma- 
licious traffic traces, instead of applying our three-step 
clustering process. Then, we extracted a set of signatures 
from each fine-grained cluster, and we tested the ob- 
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tained signature set on the HTTP traces generated by exe- 
cuting the malware samples from the Mar09 dataset. We 
repeated this process for all the other malware datasets 
(notice that the NMDR for Ju/09 is not defined in table 
Table 7, since we did not collect malware from August 
2009). By comparing the results in Table 7 with Table 3, 
we can see that the signatures obtained by applying our 
three-step clustering process always yield a higher detec- 
tion rate, compared to signatures generated by applying 
the fine-grained clustering algorithm alone. 


Feb09 

2,934 
4h4min 1h52min Jhimin 3h9min 3h18min 

38.4% 25.7% 24.2% 46.2% 36.3% 
Table 7: Results obtained using only fine-grained clustering, in- 
stead of the three-step clustering process (NMDR = next month 
detection rate). 


Mar09 
2,492 


Jun09 
2,719 


Clusters 


Apro9 
1,485 


May09 


2,805 3,343 


Time 
NMDR 





We also compared our approach to [10]. We ran- 
domly selected around four thousand samples from the 
Feb09 and May09 malware datasets. Precisely, we se- 
lected 2,038 samples from Feb09 and 1,978 samples 
from May09. We then shared these samples with the 
authors of [10], who kindly agreed to provide the clus- 
tering results produced by their system. The results they 
were able to share with us were in the form of a sim- 
ilarity matrix for each of the malware datasets we sent 
them. We then applied single linkage hierarchical clus- 
tering to each of these similarity matrices to obtain the 
related dendrogram. In order to find where to cut the 
obtained dendrogram, we used two different strategies. 
First, we applied the DB index [17] to automatically find 
the best dendrogram cut. However, the results we ob- 
tained were not satisfactory in this case, because the clus- 
ters were too “tight”. Only very few clusters contained 
more than one sample, thus yielding very specific sig- 
natures with a low detection rate. We then decided to 
select the threshold manually using our domain knowl- 
edge. This manual tuning process turned out to be very 
time-consuming, and therefore we finally decided to sim- 
ply use the similarity threshold value t = 0.7, which was 
also used in [10]. A manual analysis confirmed that with 
this threshold we obtained much better results, compared 
to using the DB index. 

We then extracted network signatures from the mal- 
ware clusters obtained using both our three-step network- 
level clustering system, our fine-grained clustering only, 
and [10] (in all cases, we used the HTTP traffic traces 
collected using our malware analysis system to extract 
and test the network signatures). All clustering ap- 
proaches were applied to the same reduced datasets de- 
scribed earlier. The results of our experiments are re- 
ported in Table 8. In the first row, “Sig_Feb09 net- 
clusters” indicates the dataset of signatures extracted 
from the (reduced) FebO9 dataset using our three-step 
network-level clustering. “Sig_Feb09 net-fg-clusters” 
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Jul09 


2h57min 


represents the set of signatures extracted using our fine- 
grained clustering only, while “Sig_Feb09 sys-clusters” 
indicates the signatures extracted from malware clus- 
ters obtained using a system-level clustering approach 
similar to [10]. We then tested the obtained signa- 
ture sets on the traffic traces of the entire malware 
datasets collected in Feb09 and Mar09. We repeated a 
similar process to obtain and test the “Sig May09 net- 
clusters”, “Sig-May09 net-fg-clusters”, and “Sig_May09 
sys-clusters” using malware from the May09 dataset. 
From Table 8 we can see that the signatures obtained us- 
ing our three-step clustering process yield a higher detec- 
tion rate in all cases, compared to using only fine-grained 
clustering, and to signatures obtained using a clustering 
approach similar to [10]. 


e090 May09— Jun09 


Sig_Feb09 net-clusters 78.6% 48.9% aa 


Sig_Feb09 net-fg-clusters 60.1% 25.1% 
56.0% 44.3% 


56.9% 33.9% 
pore S08 425% 

32.7% 32.0% 
Table 8: Malware detection rate for network signatures gener- 
ated using our three-step network-level clustering (net-clusters), 
only fine-grained network-level clustering (net-fg-clusters), and 
clusters generated using [10] (sys-clusters). 


Sig_Feb09 sys-clusters 
Sig_May09 net-clusters 
Sig_May09 net-fg-clusters 
Sig_May09 sys-clusters 





It is worth noting that while it may be possible to 
tune the similarity threshold ¢ to improve the system- 
level clusters, our network-level system can automati- 
cally find the optimum dendrogram cut and yield accu- 
rate network-level malware signatures. 

We also applied Polygraph [24] to a subset of (only) 
49 malware samples from the Virut family. Polygraph 
ran for more than 2 entire weeks without completing. It 
is clear that Polygraph’s greedy clustering algorithm is 
not suitable for the problem at hand, and that without the 
preprocessing provided by our clustering system gener- 
ating network signatures to detect malware-related out- 
bound HTTP traffic would be much more expensive. 


6.2.3 


In this section, we analyze some of the reasons 
why system-level clustering may perform worse than 
network-level clustering, as shown in Table 8. 

In some cases, malware variants that generate the same 
malicious network traffic may generate significantly dif- 
ferent system-level events. Consider the example in Fig- 
ure 6, which reports information about the system and 
network events generated by two malware variants vj, 
and v2 (which are part of our Feb09 dataset). vj, 1s 
labeled as Generic FakeAlert.h by McAfee and 
as TR/Dropper.Gen by Avira (Trend did not detect 
it), whereas v2 is labeled as DR/PCK.Tdss.A.21 by 
Avira (neither McAfee nor Trend detected this sample). 
When executed, the first sample runs in the background 
and does not display any message to the user. On the 
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other hand, the second sample is a Trojan that presents 
the user with a window pretending to be the installation 
software for an application called Aquaplay. However, 
regardless of whether the user chooses to complete the 
installation or not, the malware starts running and gen- 
erating HTTP traffic. The set of operations each variant 
performs on the system are significantly different (be- 
cause of space limitations, Figure 6 only shows filesys- 
tem events), and therefore these two samples would tend 
to be separated by system-level behavioral clustering. 
However, the HTTP traffic they generate is exactly the 
same. Both v; and v2 send the same amount of data to an 
IP address apparently located in Latvia, using the same 
two POST requests. It is clear that these two malware 
samples are related to each other, and our network-level 
clustering system correctly groups them together. We 
speculate that this is due to the fact that some malware 
authors try to spread their malicious code by infecting 
multiple different legitimate applications (e.g., different 
games) with the same bot code, for example, and then 
publishing the obtained trojans on the Internet (e.g., via 
peer-to-peer networks). When executed, each trojan may 
behave quite differently from a system point of view, 
since the original legitimate portions of their code are 
different. However, the malicious portions of their code 
will contact the same C&C. 

Another factor to take into account is that malware 
developers often reuse code written by others and cus- 
tomize it to fit their needs. For example, they may reuse 
the malicious code used to compromise a system (e.g., 
the rootkit installation code) and replace some of the 
malicious code modules that provide network connec- 
tivity to a C&C server (e.g., to replace an IRC-based 
C&C communication with code that allows the malware 
to contact the C&C using the HTTP protocol). In this 
case, while the system-level activities of different mal- 
ware may be very similar (because of a common sys- 
tem infection code base), their network traffic may look 
very different. In this case, grouping these malware in 
the same cluster may yield overly generic network signa- 
tures, which are prone to false positives and will likely be 
filtered out by the signature pruning process. Although 
it is difficult to measure how widespread such malware 
propagation strategies are, it is evident that system-level 
clustering may not always yield the desired results when 
the final objective is to extract network signatures. 


7 Limitations and Future Work 


Similarly to previous work that relies on executing 
malware samples to perform behavioral analysis [9, 10, 
20], our analysis is limited to malware samples that per- 
form some “interesting actions” (i.e., malicious activi- 
ties) during the execution time 7’. Unfortunately, these 
interesting actions (both at the system and network level) 


may be triggered by events [11] such as a particular date, 
the way the user interacts with the infected machine, etc. 
In such cases, techniques similar to the ones proposed 
in [11] may be used to identify and activate such triggers. 
Trigger-based malware analysis is outside the scope of 
this paper, and is therefore left to future work. 

Because we perform an analysis of the content of 
HTTP requests and responses, encryption represents our 
main limitation. Some malware writers may decide to 
use the HTTPS protocol, instead of HTTP. However, it 
is worth noting that using HTTPS may play against the 
malware itself, since many networks (in particular enter- 
prise networks) may decide to allow only HTTPS traffic 
to/from certified servers. While some legitimate websites 
operate using self-signed public keys (e.g., to avoid CA 
signing costs), these cases can be handled by progres- 
sively building a whitelist of authorized self-signed pub- 
lic keys. However, we acknowledge this approach may 
be hard to implement in networks (e.g., ISP networks) 
where strict security policies may not be enforced. 


Our signature pruning process (see Section 4) relies 
on testing malware signatures against a large dataset 
of legitimate traffic. However, collecting a completely 
clean traffic dataset may be difficult in practice. In turn, 
performing signature pruning using a non-clean traffic 
dataset may cause some malware signatures to be erro- 
neously filtered out, thus decreasing our detection rate. 
There are a number of practical steps we can follow to 
mitigate this problem. First, since we are mostly inter- 
ested in detecting new malware behavior, we can apply 
our signature pruning process over a dataset of traffic col- 
lected a few months before. The assumption is that this 
“old” traffic will not contain traces of future malware be- 
havior, and therefore the related malware signatures ex- 
tracted by our system will not be filtered out. On the 
other hand, we expect the majority of legitimate HTTP 
traffic to be fairly “stable”, since the most popular Web 
sites and applications do not change very rapidly. An- 
other approach we can use is to collect traffic from many 
different networks, and only filter out those signatures 
that generate false positives in the majority of these net- 
works. The assumption here is that the same new mal- 
ware behavior may not be present in the majority of the 
selected networks at the same time. 


Evasion attacks, such as noise-injection attacks [28] 
and other similar attacks [25], may affect the results of 
our clustering system and network signatures. Because 
we run the malware in a protected environment, it may 
be possible to identify what HTTP requests are actually 
performed to send or receive information critical for the 
correct functioning of the malware using dynamic taint 
analysis [13]. This may allow us to correlate network 
traffic with system activities performed by the malware, 
and to identify whether the malware is injecting ran- 
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domly generated/selected elements into the network traf- 
fic. However, taint analysis may be evaded [12] and mis- 
led using sophisticated noise-injection attacks. System- 
level malware clustering (such as [9, 10]) and signature 
generation algorithms may also be affected by such at- 
tacks, e.g., by creating “noisy” system events that do not 
serve real malicious purposes, but simply try to mislead 
the clustering process and the generation of a good de- 
tection model. Noise injection attacks are a challenging 
research problem to be addressed in future work. 


$ Conclusion 


In this paper, we presented a network-level behavioral 
malware clustering system that focuses on HTTP-based 
malware and clusters malware samples based on a notion 
of structural similarity between the malicious HTTP traf- 
fic they generate. Through network-level analysis, our 
behavioral clustering system is able to unveil similarities 
among malware samples that may not be captured by cur- 
rent system-level behavioral clustering systems. Also, 
we proposed a new method for the analysis of malware 
clustering results. The output of our clustering system 
can be readily used as input for algorithms that automat- 
ically generate network signatures. Our experimental re- 
sults on over 25,000 malware samples confirm the ef- 
fectiveness of the proposed clustering system, and show 
that it can aid the process of automatically extracting net- 
work signatures for detecting HTTP traffic generated by 
malware-compromised machines. 
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Abstract ferentiation because it gives the ISPs arbitrary control 


Holding residential ISPs to their contractual or legal 
obligations of “unlimited service” or “network neutral- 
ity” is hard because their traffic management policies are 
Opaque to end users and governmental regulatory agen- 
cies. We have built and deployed Glasnost, a system 
that improves network transparency by enabling ordi- 
nary Internet users to detect whether their ISPs are dif- 
ferentiating between flows of specific applications. We 
identify three key challenges in designing such a sys- 
tem: (a) to attract many users, the system must have 
low barrier of use and generate results in a timely man- 
ner, (b) the results must be robust to measurement noise 
and avoid false accusations of differentiation, which can 
adversely affect ISPs’ reputation and business, (c) the 
system must include mechanisms to keep it up-to-date 
with the continuously changing differentiation policies 
of ISPs worldwide. We describe how Glasnost addresses 
each of these challenges. Glasnost has been operational 
for over a year. More than 350,000 users from over 
5,800 ISPs worldwide have used Glasnost to detect dif- 
ferentiation, validating many of our design choices. We 
show how data from individual Glasnost users can be 
aggregated to provide regulators and monitors with use- 
ful information on ISP-wide deployment of various dif- 
ferentiation policies. 


1 Introduction 


A confluence of technical, business, and political in- 
terests has made “network neutrality” a hot button is- 
sue [18, 19]. The debate revolves around whether and to 
what extent Internet service providers (ISPs), who own 
and operate data networks, should be allowed to differ- 
entiate one class of traffic from another. Many ISPs want 
to restrict bandwidth-hungry applications that can hurt 
other applications in the network. Some also want to 
control applications such as VoIP that reduce ISPs’ abil- 
ity to profit from competing services of their own. In 
contrast, many content providers are against traffic dif- 


over the quality of service experienced by users. In par- 
allel, regulatory bodies and politicians are trying to de- 
vise policies that balance competing concerns [20, 21]. 

As this debate rages, ordinary Internet users are often 
in the dark, even though they are directly affected. The 
information sources available to users today are media 
reports, blogs, and statements made by ISPs; such in- 
formation sources are imprecise at best and incorrect at 
worst. As a result, much traffic differentiation occurs 
without their knowledge. However, when ISPs traffic 
management practices come to light, user outrage forces 
regulatory bodies to conduct public hearings on preva- 
lent practices [20, 21]. 

This situation led us to build and deploy a system, 
called Glasnost, that enables users to detect if they are 
subject to traffic differentiation. We make no judgment 
about whether traffic differentiation should be permit- 
ted by regulatory policy. Rather, our motivation is to 
make any differentiation along their paths transparent 
to users. 

While other recent research efforts also aim to detect 
traffic differentiation [27, 31], Glasnost is unique in its 
focus on users. Instead of providing only a broad char- 
acterization of differentiation in the Internet, our goal is 
to let individual users determine if they experience dif- 
ferentiation and quantify its impact at the time they use 
our system. 

Our focus is on enabling individuals who are not 
technically savvy. This creates design constraints that 
are typically not present in other measurement systems. 
First, the bar to using the system must be low. For 
instance, it is undesirable to require the installation of 
special software on client machines, especially if such 
software needs privileged access. This constraint hin- 
ders our ability to collect high-fidelity data (e.g., packet 
traces) or to finely control packet transmissions. We 
must limit ourselves to coarse-grained data obtained 
through unprivileged client operations. Second, the re- 
sults for an individual user must be accurate and simple 
to interpret. For example, we cannot return results that 
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rely on inferences derived from data aggregated across 
users. While such results are accurate on aggregate, they 
could be incorrect when applied to an individual user. 
Third, the system must evolve with ISP practices. Oth- 
erwise, Glasnost would gradually become unable to de- 
tect the presence of differentiation and users would stop 
trusting the system. 


We based our Glasnost design on these constraints. 
The result is a system that is effective and easy to use. 
A user can detect differentiation by simply pointing her 
browser to a Web page. The browser downloads and 
runs a Java applet which exchanges traffic with our mea- 
surement server. The client-server nature of our archi- 
tecture helps to avoid many of the operational issues 
with network measurements, such as traversing NAT's 
and firewalls, or raising alarms in network intrusion de- 
tection systems. The traffic exchange is designed to ac- 
curately and quickly detect any differentiation. We also 
build a simple flow emulation tool that simplifies the in- 
corporation of tests to detect new differentiation tech- 
niques that emerge in the Internet. 


The diversity of ISP practices makes it challenging 
to detect traffic differentiation reliably. For instance, an 
ISP might employ differentiation only at specific times 
(e.g., in the evenings), or only under high loads, or only 
for flows that send too much traffic. These factors led 
us to design an on-demand system. Each time a user 
uses Glasnost, she performs an individual test that de- 
tects the presence of traffic differentiation for her Inter- 
net connection at the time of the test. This provides a 
more reliable answer to this user than extrapolating the 
results from other testing times or other users. 


Glasnost has been operational since March 2008, en- 
abling users to detect BitTorrent differentiation. Be- 
tween March 2008 and September 2009, more than 
350,000 users from over 5,800 ISPs worldwide have 
used the system. Several individuals and corporations 
volunteered to host Glasnost measurement servers on 
their own infrastructure in order to allow operations on 
an even larger scale. We believe that our design princi- 
ples have directly contributed to the success of Glasnost. 


In addition to the design and evaluation of Glasnost, 
we also present a detailed analysis of BitTorrent differ- 
entiation in the Internet. We find that about 10% of our 
users experience differentiation of BitTorrent traffic. We 
also study ISPs’ BitTorrent differentiation policies in de- 
tail over a period of two months (from January to Febru- 
ary 2009) using data from the Glasnost tests. We find, 
for instance, that it is more common for ISPs to differ- 
entiate against file uploads than downloads and to differ- 
entiate throughout the day rather than only during peak 
hours. 
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2 Traffic Differentiation 


Traffic differentiation refers to an ISP treating the pack- 
ets of one flow differently than those of another flow. 
Based on information published by ISPs, researchers, 
and equipment vendors [5, 10, 22], we characterize traf- 
fic differentiation along three dimensions. 


1. Traffic differentiation based on flow types. To dif- 
ferentiate between flows of different types, 1.e., belong- 
ing to different applications, ISPs must distinguish the 
packets of one flow from those of other flows. This can 
be done by examining one of the following: 


(a) The IP header. The source or destination addresses 
can determine how an ISP treats a flow. For ex- 
ample, universities routinely rate-limit only traffic 
that’s going to or coming from their student dorms. 


(b) The transport protocol header. ISPs can use port 
numbers or other transport protocol identifiers to 
determine a flow’s treatment. For example, P2P 
traffic is sometimes identified based on its port 
numbers. 


(Cc 


—_ 


The packet payload. ISPs can use deep-packet in- 
spection (DPI) to identify the application generat- 
ing a packet. For example, ISPs look for P2P pro- 
tocol messages in packet payload to rate-limit the 
traffic of P2P applications, such as BitTorrent. 


2. Traffic differentiation independent of flow type. In 
addition to features of a flow itself, an ISP may use other 
criteria to determine whether to differentiate. Some of 
these include: 


(a) Time of day. An ISP may differentiate only during 
peak hours. 


(b) Network load. An ISP may differentiate on a link 
only when the network load on that link is high. 


(c) User behavior, An ISP may differentiate only 
against users with heavy bandwidth usage. 


3. Traffic manipulation mechanisms. There are a 
number of ways in which an ISP can treat one class of 
packets differently. 


(a) Blocking. One form of differentiation is to termi- 
nate a flow, either by blocking its packets or by 
injecting a connection termination message (e.g., 
sending a TCP FIN or TCP RST packet). 


(b) Deprioritizing. Routers can use multiple priority 
queues when forwarding packets. ISPs can use this 
mechanism to assign differentiated flows to lower 
priority queues and to limit the throughput of cer- 
tain classes. 
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(c) Packet dropping. Packets of a flow can be dropped 
either using a fixed or variable drop rate. 


(d) Modifying TCP advertised window size. ISPs can 
lower the advertised window size of a TCP flow, 
prompting a sender to slow down. 


(e) Application-level mechanisms. ISPs can control an 
application’s behavior by modifying its protocol 
messages. For example, transparent proxies [28] 
can redirect HTTP or P2P flows to alternate con- 
tent servers. 


What kinds of traffic differentiation does Glasnost 
detect? 

Our current implementation of Glasnost detects traf- 
fic differentiation that is triggered by transport protocol 
headers (e.g., ports) or packet payload. These triggers 
are more common than IP headers [1, 5]. 

We designed Glasnost to be an on-demand system. 
Each time a user uses Glasnost, we detect traffic differ- 
entiation between flows of the user at the time of the 
test. While Glasnost has not been designed detect traf- 
fic shaping that affects all flows of a user, e.g., based 
on time of day or network load or user behavior, it is 
possible to infer such shaping policies by aggregating 
and comparing the results of Glasnost tests conducted at 
different times of the day by different users on different 
networks. 

Instead of inferring differentiation based on a particu- 
lar manipulation mechanism, Glasnost detects the pres- 
ence of differentiation based on its impact on application 
performance. 


3 Design Principles 


In the process of developing Glasnost we identified sev- 
eral key design principles. Although in Glasnost our 
focus is traffic differentiation, the design principles we 
identified are more general and apply to many mea- 
surement systems that want to attract a large number of 
users. In this section, we discuss these principles in de- 
tail and argue why they are generally useful when de- 
signing measurement systems for Internet users at large. 

Our goal was to build a system that lets ordinary In- 
ternet users determine if they are affected by traffic dif- 
ferentiation. Because of its focus on end users and the 
nature of its measurements, Glasnost must satisfy cer- 
tain design requirements that are typically not present 
in other measurement systems. We distill these require- 
ments into three design principles. These principles dic- 
tate that the system must be easy to use so that it can 
serve any Internet user, its inferences must be robust and 
simple to interpret, and it must be extensible to allow de- 
tection of new network policies as they evolve. 


We explain these principles in detail below and also 
describe the consequences they have on the design of 
Glasnost. These consequences motivate certain design 
choices and rule out many others. 


Principle #1: Low barrier of use 

Attracting a large number of users to a measurement 
system requires having a low barrier of use. Although 
this challenge appears obvious, solving it is the key to 
success. As we discuss later, it complicated the design 
of other aspects of the system. But at each step we re- 
sisted the temptation to compromise in the interest of 
other desirables such as efficiency and higher-fidelity 
data. 

Design consequences. There are four design conse- 
quences of this principle. First, because most users are 
not technically savvy, the interface must be simple and 
intuitive. Second, we cannot require users to install new 
software or perform administrative tasks. Many network 
measurement techniques require installing drivers (e.g., 
the WinPcap library for Windows) or running privileged 
code (e.g., raw sockets) on users’ machines. Such code 
can provide detailed, low-level data (e.g., packet traces) 
that simplifies the measurement task. But in our experi- 
ence, users are often unwilling to use systems with such 
requirements. For example, one of our earlier attempts 
required users to run code with administrator privileges 
on their machines and to leave a port open in their fire- 
walls and NATs. These obstacles greatly limited adop- 
tion; we attracted fewer than fifty users. Third, because 
many users have little patience, the system must com- 
plete its measurements quickly. Fourth, to incentivize 
users to use the system in the first place, the system 
should display per-user results immediately after com- 
pleting the measurements. 

In order to satisfy above the requirements, our current 
client-side implementation uses a small-size Java applet 
(21 KBytes) that users download off our webpage. The 
applet exchanges traffic with our servers, which we then 
analyze to detect differentiation (we explain the nature 
of this traffic below). The test runs for about 6 minutes. 
Immediately after the test is finished Glasnost whether 
the user is affected by traffic differentiation. 

Our quick and simple test methodology is inspired 
by non-research-oriented web sites for broadband speed 
tests [2] and represents a departure from other research 
systems. For instance, Scriptroute [25] requires users 
to write their own measurement scripts, and thus its use 
has been limited to researchers and other experts. 


Principle #2: Measurement accountability 

Because the system is designed for ordinary users, it is 
essential that the measurements are accurate and that the 
results cannot be misinterpreted. For instance, consider 
the results of an experiment to infer path capacities in 
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the Internet. Since the measurements can be affected by 
transient noise, researchers will know that the answer 
computed along an individual path cannot be trusted but 
the answers can be aggregated to provide an accurate 
estimate of path capacity. But an ordinary user that is 
interested in the capacity of her own path might not be 
in a position to make that distinction. 


When detecting traffic differentiation, accurate inter- 
pretation of results is critical due to the controversial na- 
ture of traffic management in the Internet: there is still 
a heated debate whether it is legal for an ISP to em- 
ploy traffic management. In addition, if people were 
to falsely interpret results as their ISP performing traf- 
fic differentiation when in fact it is not, the system 
would quickly lose credibility. In fact, in the past there 
have been instances when some widely publicized stud- 
ies have mistakenly accused ISPs of using policies they 
never deployed [26, 29]. 


Design consequences. Maintaining measurement ac- 
countability has three design consequences. First, the 
test to detect differentiation should, to the extent pos- 
sible, marginalize any factors that add uncertainty. The 
performance of an Internet flow can be affected by many 
confounding factors. This includes the operating sys- 
tem, especially its networking stack and its configu- 
ration. Additionally, directly using application client 
software is problematic as it does not give full control 
over the measurement traffic. Short-term throughput of 
such “natural” flows can differ because of differences in 
packet sizes and burstiness. Finally, we have to consider 
transient noise, as, e.g., caused by background traffic. 


With passive measurement tools, it is often not easy 
to isolate these factors. These tools must take into ac- 
count for a large number of confounding factors in their 
inference. The complexity of this analysis can lead to 
inaccurate results. In contrast, active measurements can 
be designed to avoid most confounding factors. Having 
full control over the traffic that is sent to measure per- 
formance simplifies the analysis. Further, active mea- 
surements allow to run all measurements between the 
same pair of hosts, removing factors like OS and net- 
working stack. The only remaining confounding factor 
is transient noise, which can be dealt with using sim- 
ple techniques such as repeating measurements multiple 
times. 


Second, because not all uncertainty can be removed 
from the inference, the result presented to the user must 
be conservative, with a near-zero false positive rate. 
In the context of traffic differentiation, a false positive 
means that the system falsely claims that the user is ex- 
periencing traffic differentiation. Minimizing false pos- 
itives 1s challenging because it results in an increase in 
the false negative rate. This trade-off is inherent. 
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Because of the concerns above, our testing primitive 
is based on comparing the throughput of a pair of flows. 
One flow in the pair belongs to the potential victim ap- 
plication. The second is a reference flow that belongs 
to a different application. The flows are identical except 
for the trigger that we want to test for differentiation, 
such as port number or payload. The flows are gener- 
ated back-to-back and multiple pairs are run to reduce 
and calibrate the effect of noise. 


Third, we must be prepared to provide the data and 
the evidence behind our inferences when requested. We 
retain the data of all measurements in which Glasnost 
detects traffic differentiation. We treat this data as evi- 
dence. If we are challenged to justify our findings, the 
stored data will help us explain on what basis Glasnost 
declared that an ISP is using traffic differentiation. 


Principle #3: Easy to evolve 


To remain relevant, a system that wants to detect traf- 
fic differentiation must be able to evolve as ISPs evolve 
their traffic management policies. For example, in Fall 
2008, Comcast blocked BitTorrent uploads for some of 
its customers [10]. Several months later, they started 
replacing this practice with less severe forms of differ- 
entiation [7]. In fact, our recent measurements indicate 
that BitTorrent traffic blocking is rare today unlike in 
2008. A system with a fixed set of capabilities will have 
a limited shelf life in such an evolving environment. 


Design consequences. This principle mandates in- 
corporation of mechanisms that help the system evolve 
with the network. Network evolution may be inciden- 
tal or adversarial. In an incidental evolution, ISPs might 
target new applications in the future or use new traffic 
manipulation mechanisms. A detection system should 
be extensible, to add tests that detect traffic differentia- 
tion against popular new applications or based on new 
shaping techniques. Glasnost enables advanced users to 
submit packet-level traces of applications that they sus- 
pect are being targeted by their ISPs. User suspicion is 
powerful; it was how many of the currently known ISP 
differentiation behaviors came to light. We do not ex- 
pect all users to be able to submit traces but there are 
many enthusiastic users that are capable of collecting 
(with our help if needed) and sharing traces. Glasnost 
then makes it easy to use these network traces to con- 
struct new detection tests. These tests help us keep pace 
with new traffic differentiation techniques and applica- 
tions that may be targeted. 


Adversarialy, ISPs could begin whitelisting traffic 
from measurement servers in an attempt to evade detec- 
tion. A successful system must be aware of this problem 
and find ways to minimize whitelisting. Our solution 
was to make our server code publicly available. Any- 
one can setup Glasnost on a well-provisioned server and 
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Figure 1: The Glasnost system. (1) The client contacts 
the Glasnost webpage. (2) The webpage returns the ad- 
dress of a measurement server. (3) The client connects 
to the measurement server and loads a Java applet. The 
applet then starts to emulate a sequence of flows. (4) 
After the test is done, the collected data is analyzed and 
a results page is displayed to the client. 


other users can start measuring to new servers. Mak- 
ing our code publicly available allowed other Glasnost 
servers to appear on the Internet, which makes it hard 
for ISPs to evade detection. | However, this method 
is not foolproof; a determined ISP may choose to stay 
up-to-date with the list of Glasnost servers. We doubt 
that many ISPs would be willing to invest significant 
effort in evading detection. As much as an ISP would 
like to conceal its traffic management practices from 
the public, denying those practices or making blatant at- 
tempts to hide them is risky. Such behavior, if detected, 
would attract intense scrutiny from telecom regulators 
and would severely damage the ISP’s reputation. For 
example, when Comcast’s BitTorrent blocking practices 
were revealed to the public [1], Comcast was fined by 
the FCC and was subjected to highly critical media cov- 
erage. 


4 Design of Glasnost 


We now present the design of Glasnost based on the re- 
quirements outlined above. 


4.1 System architecture 


Glasnost is based on a client-server architecture. Clients 
connect to a Glasnost server to download and run var- 
ious tests. Each test measures the path between the 
client and the server by generating flows that carry 
application-level data. This data is carefully constructed 
to detect traffic differentiation along the path. 

Figure | presents a high-level description of how 
clients measure their Internet paths. A client first con- 
tacts a central webpage that redirects to a Glasnost mea- 
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Figure 2: The Glasnost web interface. 


surement server. This dynamic redirection enables load 
balancing across measurement servers and makes it easy 
to incorporate new servers by adding them to the redi- 
rection list. 

After the client is redirected, the measurement server 
presents a simple interface to the user. As shown in Fig- 
ure 2, the user selects the application traffic she would 
like to test and starts the test by just clicking the “Start 
testing” button. The client’s browser downloads a Java 
applet that starts exchanging packets with the server. We 
elaborate on the Glasnost measurement tests next. 


4.2 Measurement tests 


The key primitive behind the Glasnost measurement 
tests is the emulation of a pair of flows that are iden- 
tical except in one respect that we suspect triggers dif- 
ferentiation along the path. Comparing the performance 
of these flows helps to determine if differentiation is in- 
deed present. 

Figure 3 shows two flows designed to detect whether 
differentiation based on BitTorrent protocol content is 
present along a path. The exchange on the left corre- 
sponds to the first flow. The client opens a TCP con- 
nection to the measurement server and starts exchang- 
ing packets that implement the BitTorrent protocol: the 
packet payloads carry BitTorrent protocol headers and 
content. The exchange on the right corresponds to the 
second flow. The client opens another TCP connection 
and performs the same packet exchange, but the packets 
contain random bytes instead of BitTorrent headers or 
data. An ISP that differentiates against BitTorrent based 
on protocol messages would impact only the first flow. 
Thus, significant differences in the flows’ performance 
is likely to be caused by the differences in their pay- 
loads and lets us detect whether differentiation 1s present 
along the path. Transient noise can also lead to differ- 
ences in flows’ performance; we describe in the next 
section how we handle noise. 


NSDI 710: 7th USENITX Symposium on Networked Systems Design and Implementation 409 


410 


Client Server Client Server 


‘Handshake [68B] Random [68B] 


___Handshake [68B1) Random [688] | 


i‘Bitfielg 166B Random 1L66B 3 


Interested [oB] Random [5B] 


Random 





_____unchoke [581 


Random [17B] 


Request [17B] 


piece [256KB) 
(a) BitTorrent flow (b) Reference flow 
Figure 3: A pair of flows used in Glasnost tests. The 
two flows are identical in all aspects other than their 


packet payloads, which allows us to detect differentia- 
tion that targets flows based on their packet contents. 


During the test, the measurement server records a 
packet-level trace of all emulated flows and the client 
applet records ancillary information including excep- 
tions caused by network errors. Once the transfers end, 
the client uploads the recorded information to the server. 
The server analyzes this information together with the 
traces collected on the server-side and shows the find- 
ings to the client. 

Glasnost’s emulation methodology leads to measure- 
ment robustness. As Figure 3 shows, application-level 
data is the only difference between the two emulated 
flows. The two flows traverse the same network path and 
have the same network-level characteristics, such as port 
numbers, packet sizes, etc. In contrast, passive measure- 
ment, a different technique, may have many factors dif- 
fer across these measured flows. Correctly accounting 
for all such differences is challenging. 

Another benefit of active measurement is the ability 
to carefully control the measurement test. For example, 
we can repeat flows with different payloads or port num- 
bers. This ability allows Glasnost to precisely identify 
the specific factors that trigger differentiation. 

In the next section, we describe our measurement test 
in more detail and how we make it robust to transient 
noise. We describe how we make the system easy to 
evolve using a trace replay based tool for constructing 
measurement tests in Section 6. 


5 Robust Detection of Differentiation 


As described earlier, Glasnost emulates a pair of flows 
and determines the presence of traffic differentiation by 
comparing their performance. When comparing the per- 
formance of a pair of flows, we must ensure that their 
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difference is indeed due to the differences in their con- 
tent and not due to some changes in the test environ- 
ment. Our measurement tests are constructed in a way 
that eliminates all major confounding factors except one 
— transient noise due to interference from cross-traffic 
(background traffic) along the measurement path. In this 
section, we discuss techniques to robustly detect traffic 
differentiation in the face of transient noise. 

The primary challenge in this task stems from the 
fact that the noise can vary at small time-scales. Thus, 
two flows can be affected differently even if run back- 
to-back. As one egregious example, we found that the 
throughput of two back-to-back flows differed by a fac- 
tor of three even though the flows were identical. A sim- 
plistic detection method will mistakenly detect differen- 
tiation in this case. It might appear that the differential 
impact of noise could be reduced by running the flows 
simultaneously. But we find that setup to be even worse 
because of self-interference among the two flows. 

Our basic strategy for robust detection is to run each 
flow type multiple times. We use the variance in the 
performance of the flows of the same type to identify 
paths that are too noisy to enable reliable detection. For 
the remaining paths, we can then detect differentiation 
by comparing the flows of different types. We first de- 
scribe how we apply this strategy when tests are run long 
enough that we do not have to worry about having too 
little data. As we found that many users are too impa- 
tient to run long tests, we adapted our strategy to tests 
that run for a shorter duration. 

We describe our method using throughput as the mea- 
sure of flow performance’, since it is of prime interest to 
many applications and is the target of many ISPs looking 
to reduce their network load. Because of TCP dynam- 
ics, throughput is directly affected by any differentiation 
that impacts flow latency or loss. 


5.1 Filtering tests affected by noise 


To detect the level of transient noise, we repeat the runs 
of the two flow types multiple times back-to-back. Un- 
like active ISP differentiation, transient noise does not 
discriminate based on flow content; it would not affect 
multiple runs of the same flow type and thus can be de- 
tected by comparing their performance. 

To understand transient noise patterns and the extent 
to which they affect flow throughput, we configured our 
Glasnost deployment to run a BitTorrent flow and a ref- 
erence flow with random bytes, five times each. The 
runs of the two flow types were interspersed and each 
flow lasted for 60 seconds to allow sufficient time for 
TCP to achieve stable throughput. Over a period of one 


‘Our method can be extended to other measures of performance such 
as jitter. 
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Figure 4: The four classes of noise we observed in 
our analysis. The graph shows the minimum, median, 
and maximum throughputs observed in example tests af- 
fected by each class of noise. 


month, we collected measurements of 3,705 residential 
broadband hosts, 2,871 in the upstream and 834 in the 
downstream direction. 

We compared the throughput obtained by the five runs 
of each flow type with each other. Our analysis of the 
maximum, median, and minimum throughput reveals 
the four distinct patterns shown in Figure 4, correspond- 
ing to four different cross-traffic levels: 


1. Consistently low cross-traffic: all throughput 
measurements belonging to the same flow type fall 
within a narrow range (i.e., min 1s close to max). 


2. Mostly low but occasionally high cross-traffic: a 
majority of throughput measurements are clustered 
around the maximum but a few points are farther 
away (i.e., max and min are far apart but median is 
close to max). 


3. Highly variable cross-traffic: the throughput 
measurements are scattered over a wide range (i.e., 
max and min are far apart and median is far apart 
from both). 


4. Mostly high but occasionally low cross-traffic: a 
majority of throughput measurements are clustered 
around the minimum but a few measurements are 
farther away (i.e., max and min are far apart but 
median is close to min). 


Our categorization of the level of cross-traffic in each 
case is based on two key observations about the nature 
and impact of cross-traffic. First, cross-traffic only low- 
ers throughput and never improves it. Thus, when a 
majority of throughput measurements are close to min 
but far apart from max (as in category 4 above), it is 
more likely that the noise-free throughput is closer to 
max than min. 

Second, cross-traffic is unlikely to be consistently 
high over a long period of time. In theory, measurements 
in category 1 above could be explained by consistently 
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Figure 5: Noise observed in our 3,705 sample dataset. 
85.2% of upstream flows and 75.7% of downstream 
flows have less than 20% of noise. Noise is measured as 
the difference between maximum and median through- 
put calculated as a percentage of maximum throughput. 


high cross-traffic. But, this would require the cross- 
traffic to remain high and consistent (without changing) 
over the duration of the entire experiment, which is ten 
minutes. We believe that this is unlikely. 

For robust detection of differentiation, we discard all 
tests where a majority of flows are affected by high noise 
(1.e., categories 3 and 4). For these tests, we cannot de- 
termine whether the difference in throughput is caused 
by differentiation or transient noise. We analyze only 
the remaining tests, for which a majority of runs experi- 
ence low noise (1.e., categories | and 2). 

To help determine which tests belong to the predom- 
inantly low noise category, we plot the difference be- 
tween maximum and median throughput as a percentage 
of maximum throughput in Figure 5. We found that for 
a large majority of tests (85.2% of upstream tests and 
75.7% of downstream tests) the median throughput is 
within 20% of the maximum. The difference between 
median and maximum throughput is considerably larger 
for the remaining flows. We thus use the 20% difference 
between median and max throughputs as a threshold 
to discard tests that are significantly affected by noise. 
Next, we describe how we detect traffic differentiation 
within the remaining tests. 


5.2 Detecting differentiation in low-noise 
tests 


To detect traffic differentiation among tests that are iden- 
tified as low noise, we compare the maximum througput 
of each flow type. Our decision to use the maximum 
is based on the observations that (a) in low-noise cases, 
most measurements lie close to the maximum through- 
put and (b) because noise tends to lower throughput, the 
maximum throughput is a good approximation for what 
the flows would achieve without cross-traffic. 

We infer that the two flow types are being treated dif- 
ferently if the maximum throughput of one differs from 
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Figure 6: Selecting a good throughput difference 
threshold. Thresholds smaller than 20% tend to pro- 
duce a significant number of false positives. 


that of the other by more than a threshold 0. Selecting a 
good 6 involves a trade-off. With high values, we can- 
not detect differentiation unless the impact on through- 
put is high. For instance, with d6=50%, we would only 
detection differentiation that halves the flow throughput. 
Thus, high values raise the false negative rate. On the 
other hand, with low values of 0 (say 5%), we risk false 
positives, i.e., declaring that ISPs are employing traffic 
differentiation while they actually do not. 

To understand how the false positive rate varies with 
0, we selected 302 test runs from users from ISPs that 
we know do not differentiate. Figure 6 plots the per- 
centage of tests that are falsely marked as being differ- 
entiated for different threshold values. The plot shows 
an interesting trend; the false positive rate drops steeply 
until 6 reaches 20%. Beyond this threshold, there are a 
handful of hosts (0.58%) that pass our noise tests but are 
still falsely marked as differentiated. To avoid any false 
positives, we would need to raise the threshold to 40%, 
which increases the false negative rate. 

We thus set 6 to 20%. With this value we maintain a 
low false positive rate (under 0.6%), but we fail to detect 
differentiation that reduces a flow’s throughput by less 
than 20%. We consider this an acceptable trade-off. 


5.3. User impatience with long tests 


As described above, we configured Glasnost to run a 
pair of one-minute-long flows five times, resulting in a 
total test ttme of 10 minutes. The tests we originally 
deployed also detected whether the differentiation was 
based on port number or payload, extending the test du- 
ration to 20 minutes. While this test configuration en- 
ables us to detect differentiation with high confidence, 
we noticed that a considerable fraction of users were 
aborting the tests before completion. 

Figure 7 shows how long users keep their Glasnost 
test running. The plot for 20 minute long tests shows an 
alarming decline in the percentage of users as the test 
progresses. Only 40% of the users stay till the end and 
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Figure 7: Duration users run the Glasnost test. 
Longer duration tests are aborted by a larger fraction of 
users. 


nearly 50% aborted their tests within the first 10 min- 
utes. The sudden drop near the 20 minute point corre- 
sponds to successfully completed tests. 

Our results show that users are impatient. Most are 
not willing to use tests that take more than a few min- 
utes. To confirm this, we reconfigured Glasnost to use 
shorter-duration tests. We reduced the number of times 
we repeat each flow type to two (from five), and we de- 
creased the duration of each flow to 20 seconds (from 60 
seconds), which is still sufficient for TCP to exit slow- 
start and achieve stable throughput. We bundled the tests 
for both upstream and downstream directions, and the 
resulting test takes 5.33 or roughly 6 minutes. 

Figure 7 shows also how long users keep the 6 minute 
Glasnost test running. More than 80% of the users stay 
till the end, confirming that shorter tests on the order of 
a few minutes are more effective at retaining users. 


5.4 Detecting differentiation with short 
tests 


Short duration tests are challenging for detecting differ- 
entiation robustly because they gather few measurement 
samples. To estimate the impact of this reduction in data 
on detection accuracy, we consider data from the longer 
tests for which we have a result, i.e., for which we know 
whether or not the ISP is differentiating. We prune the 
data to include only what would be gathered by the short 
test and run our analysis on the pruned data. We com- 
pare the results from this shorter test data with those ob- 
tained before. 

We find that nearly 25% of the long tests that we were 
able to successfully analyze before, were discarded as 
too noisy after pruning. We find the false positive rate 
(1.e., cases when the long test found no traffic differen- 
tiation but the short test did) to be 2.8% and the false 
negative rate (1.e., cases when the long test found traffic 
differentiation but the short test did not) to be 0.9%. 

We also find that we can achieve a four-fold reduc- 
tion in false positive rate, to 0.7% (which is comparable 
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to the false positive rate of long tests), by raising the 6 
threshold from 20% to 50%. While this increases the 
false negative rate to 1.7%, we consider it an acceptable 
trade-off. 


6 Facilitating New Test Construction 


Manually implementing Glasnost tests for a new appli- 
cation is a laborious and error prone task. It requires 
detailed knowledge of the application’s protocols and 
their common implementations. This creates a high bar- 
rier for new test construction, making it difficult to keep 
pace with the evolution of ISPs’ policies. 

In this section, we present a tool called 
trace-emulate that simplifies the construction 
of new tests by automating most of the process. We 
also present a validation of the tests constructed by 
trace-emulate using the open source DPI engine 
of a commercial traffic shaper [22]. 

Our trace-emulate tool automatically generates 
a new Glasnost test from the packet-level trace of an 
application. It extracts the essential characteristics of 
the application flows. These include packet sizes and 
payloads as well as the order of packets with protocol 
messages and the inter-packet timing. 

The test configuration that t race-emulate out- 
puts is then used by the Glasnost Java applet to run the 
test. When run against the server, the applet exchanges 
two flows. The first flow has the same characteristics 
as the original trace. For example, assume that in the 
original trace the client performed the following opera- 
tions: (1) sent packet A, (2) received packet B, (3) sent 
packet C' after t seconds. These operations occur in the 
same order and relative times in the generate flow. In 
some cases, simultaneously preserving packet ordering 
and inter-packet timing is impossible. Such cases arise 
when an endpoint is waiting to receive a packet that gets 
delayed in the network. We make the endpoint (client 
or the server) wait until the packet is received before 
continuing the emulation, even though it increases the 
inter-packet time. Our decision to preserve ordering at 
the expense of timing is motivated by the observation 
that ISPs often use the sequence of protocol messages 
to identify applications, rather than their relative timing. 
The second flow exchanged by the applet 1s a reference 
flow with the same characteristics but uses different pay- 
loads and ports. The user uploading the trace can set 
the ports to specific values, e.g., the application’s default 
port; otherwise, random ports are used. 

Our experiments confirm that the replaying method 
of trace-emulate produces the same packet sizes, 
payloads, and ordering as the original trace. We omit 
detailed results. 


Validating tests generated by trace-emulate. We vali- 
date that t race-emulate captures the essential char- 
acteristics that an ISP might use to identify an applica- 
tion flow in practice. While ISPs can, in theory, use ar- 
bitrarily complex mechanisms, in practice they are lim- 
ited to using mechanisms that can scale to at least mul- 
tiple Gbps. We are therefore interested in validating 
trace-emulate against practical detection mecha- 
nisms used by ISPs. 

As one might imagine, ISPs use traffic classification 
solutions from third-party vendors such as Sandvine, 
BlueCoat, and Arbor Networks; most ISPs do not build 
their own system. Fortunately, pressure from privacy 
watchdogs compelled one of these vendors — Ipoque — 
to release the code it uses to inspect user traffic [22]. 
This release gives the research community, for the first 
time, access to production code that ISPs use to detect 
the application that a user is running. 

The Ipoque code allows us to realistically validate 
trace-emulate. By inspecting the code we dis- 
cover what applications are detected. We run the ap- 
plication and check whether the Ipoque detector de- 
tects the application from the packet flow. We then use 
trace-emulate to generate a Glasnost test for that 
application. We run the test and check whether the re- 
sult is the same. If Ipoque detects our emulated flow 
as the target application, then we have successfully cap- 
tured the essential characteristics that are necessary for 
detection by a commercial traffic classifier. 

Ipoque’s detector can identify traffic from more than 
90 widely-used applications broadly classified as peer- 
to-peer, video streaming, instant messaging, online 
gaming, and other applications (email, web, etc.). It 
took us less than two hours to generate Glasnost tests 
for 10 representative applications in all five of the above 
categories. This included eMule, Gnutella, and BitTor- 
rent (all P2P); YouTube (streaming video); World of 
Warcraft (online game); IRC (instant messaging); and 
HTTP, FTP and IMAP. For eMule and Gnutella, Ipoque 
separately identifies their control and data connections; 
consequently, we used trace-emulate to generate 
the corresponding two tests. That we were able to gen- 
erate all tests in a matter of hours is a testament to the 
simplicity of trace-emulate. 

In every single case, Ipoque identified the test gen- 
erated by trace-emulate as the target application. 
To the extent Ipoque is representative of other similar 
vendors, we can claim that trace-emulate captures 
the essential flow characteristics for applications that do 
not encrypt traffic. However, without knowledge of how 
Ipoque detects applications from encrypted traffic, we 
cannot make any claims in that regard. 

To convince ourselves that the Ipoque result holds in 
the real-world, we further validated trace-emulate 
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Port-based Content-based 


Application 





BitTorrent 6881, down down 
eMule data 4662, down down 
Gnutella control |} 6346, down+up down+up 
Gnutella data 6346, down+up down 
HTTP no no 
IMAP no no 
SSH no no 


Table 1: Results from running new Glasnost tests on 
a host connected via Kabel Deutschland. We iden- 
tified instances of port-based and content-based traf- 
fic differentiation both the downstream (down) and up- 
stream (up) directions. 


against Kabel Deutschland, the biggest cable ISP in Ger- 
many. Kabel Deutschland targets P2P filesharing ap- 
plications between 6pm and midnight [12]; their cho- 
sen vendor for traffic shaping equipment is unknown. 
In any event, since we know their policy, validating 
trace-emulate is straightforward. We run tests 
we generated for BitTorrent, eMule, Gnutella, HTTP, 
IMAP, and SSH from a Kabel Deutschland user, and 
check if Glasnost detects traffic differentiation. 

Glasnost detected traffic differentiation for each of 
the P2P applications, and none of the non-P2P appli- 
cations. In fact, by running the tests in both directions 
(downstream and upstream) and using different ports 
(default application port, random port), we were able to 
refine the policy published by Kabel Deutschland. Ta- 
ble 1 shows that P2P traffic is differentiated regardless 
of the port number used (1.e., based on the packet con- 
tent). Next, we ran the HTTP, IMAP, and SSH tests on 
the ports typically used by the three P2P applications 
and found the flows achieved significant lower through- 
put. Running the same tests on random ports resulted in 
normal throughput. This is precisely what one might ex- 
pect if Kabel Deutschland additionally uses port-based 
detection, which naturally has false-positives. Regard- 
less of whether Kabel Deutschland sought to omit men- 
tion of side effects of their differentiation policy or we 
have identified a misconfiguration, our finding demon- 
strates the value of network transparency tools such as 
Glasnost. 


7 Deployment Experiences 


We deployed Glasnost publicly on the Internet on 
March 18th, 2008 and it has been operational ever 
since. It can be accessed at http:/broadband.mpi- 
sws.org/transparency/glasnost.php. Initially, Glasnost 
was deployed on eight servers at MPI-SWS. Over the 
last year, the number of servers has grown to eighteen 
with the use of Measurement Lab (M-Lab) [16], an open 
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platform for the deployment of Internet measurement 
tools to enhance network transparency. Eleven servers 
are in Europe, three on the west coast of the USA, and 
four on the east coast of the USA. 

In the beginning, we chose to focus on one application 
as we developed our system and refined its techniques. 
We picked BitTorrent because it is widely suspected of 
being manipulated by ISPs [15]. However, our differ- 
entiation detection techniques are not specific to BitTor- 
rent and can be applied to other applications as well. 
Because we have only recently deployed tests for other 
applications, an overwhelming majority of our data is 
from BitTorrent. We thus limit most of the discussion 
below to BitTorrent. 


Details of deployed tests. In this paper, we present 
results for four BitTorrent tests deployed on Glasnost. 
These tests detect port- and content-based differentia- 
tion in the upstream as well as the downstream direction. 
Each test involves emulating BitTorrent and reference 
flows. For detecting content-based differentiation, we 
replace BitTorrent packet payloads with random bytes 
in the reference flows, while keeping other aspects iden- 
tical. For detecting port-based differentiation, only the 
port of the reference flow is switched from a well-known 
BitTorrent port (e.g., 6881) to a neutral port that is not 
associated with any particular application (e.g., 10009). 
We emulate flows in both upstream and downstream di- 
rections to check for manipulation of both BitTorrent 
uploads and downloads. As described in Section 5, we 
configured Glasnost to offer a 6-minute long test to users 
with each flow running for 20 seconds. Also, each flow 
type is repeated once. 


Usage. Between March 18th, 2008 and September 21st, 
2009, 368,815 users? from 5,846 ISPs used Glasnost to 
test for traffic differentiation. We believe that our large 
user base is a result of our focus on lowering the barrier 
of use such that even lay users can use our system. 

Figure 8 shows that our users have a wide geo- 
graphical footprint. They come from North Amer- 
ica (38%), Europe (36%), South America (11%), Asia 
(12%), Oceania (3%), and Africa (<1%). 

Table 2 lists the top 20 access ISPs to which our users 
belonged. Users’ IP addresses are mapped to ISPs us- 
ing whois information from the Regional Internet Reg- 
istries. We see that a large fraction of our users are from 
some of the largest residential ISPs in their respective 
countries, such as Comcast in the USA, Bell Canada in 
Canada, or BT in the UK. 


In this section, we use the terms tests, IP addresses, and users in- 
terchangeably. There are very few IP addresses from which we saw 
repeat tests and a vast majority of tests correspond to an unique IP 
address. The same end user may be associated with different IP ad- 
dresses during the course of our study. By overlooking this, we may 
be over-counting the number of unique end users. 
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Figure 8: Location of Glasnost users. 





ISP Tests || ISP Tests 
Comcast (US) 29,464 || BT (UK) 5,192 
RoadRunner (US) 16,257 || Chunghwa T. (TW) 5,084 
AT&T (US) 10,884 || Shaw (CA) 4,933 
UPC (NL) 8,871 || Brasil Telec. (BR) 4,862 
Verizon (US) 7,611 |} Rogers (CA) 4,499 
Cox (US) 4,194 || Telefonica (BR) 4,408 
Net Virtua (BR) 7,207 || Telefonica (ES) 4,229 
Telecom Italia (IT) 6,955 || NTL (UK) 3,852 
Charter (US) 3,634 || Vivo (BR) 3,723 
Bell Canada (CA) 5,233 || GVT (BR) 3,123 


Table 2: Top 20 ISPs based on the number of Glas- 
nost tests conducted by their users. 


7.1 Characterizing BitTorrent Differenti- 
ation 


We now use the data collected during our deployment 
to characterize BitTorrent differentiation in the Internet. 
To our knowledge, such detailed characterization was 
not available before. 

Figure 9 shows the percentage of users for whom 
we detected differentiation in at least one of the four 
tests that we widely deployed on Glasnost. Aside from 
a few weeks in the beginning when we did not have 
enough users, this percentage has stayed roughly con- 
stant around 10%. Thus, a non-negligible fraction of our 
testers are subject to differentiation. 

We do not, however, claim that 10% of all Internet 
users experience differentiation. Glasnost users are self- 
selecting, and our data may be biased towards users that 
suspect their ISP to be differentiating against BitTorrent. 


7.2 Understanding ISP behaviors 


Our Glasnost deployment was so popular that we had 
hundreds of users from some of the largest ISPs world- 
wide. Aggregating results from all the users belonging 
to an ISP can provide an understanding of the extent 
to which the ISP differentiates traffic. Such ISP-wide 
perspectives are especially useful for policy makers and 
government regulators responsible for monitoring ISP 
behavior. Further, end users can compare the state of 
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Figure 9: Percentage of tests in which we detected 
differentiation since March 2008. 


differentiation across different ISPs to make a more in- 
formed choice when selecting their ISP. 

We now turn our attention to understanding the poli- 
cies of individual access ISPs. For this analysis, we 
map users to their access ISPs (using whois) and assume 
that the access ISP is responsible for any observed dif- 
ferentiation. While it is possible that the responsibility 
lies with a transit ISP along the path, differentiation is a 
more common practice amongst access ISPs [1, 5]. 

We limit the analysis in this section to the tests con- 
ducted in the two-month period that covers January and 
February 2009 because the differentiation behavior of 
an ISP can change over time. We select the two-month 
period for which we have the most data. Further, we 
consider only ISPs for which we have at least 100 tests 
in this time period. There are 140 such ISPs. 


7.2.1 Basis for differentiation 


Table 3 shows the list of the top-30 ISPs ranked based 
on the fraction of hosts that detected differentiation. Ta- 
ble 3 also shows how traffic is differentiated. More than 
half the ISPs differentiate only in the upstream direction 
and 7 ISPs only in the downstream direction. 20% of 
ISPs (e.g., Clearwire, TVCABO) differentiate in both 
directions. We also find that most differentiating ISPs 
use both content- and port-based differentiation. For 
only four ISPs (Free, GVT, Pipex, and Tiscali UK) do 
we observe an exclusive use of port-based differentia- 
tion (which is easier to evade). And only one ISP, Oi, 
uses content-based differentiation exclusively. 

Our results show that Glasnost can shed light on how 
ISPs identify the traffic they differentiate. 


7.2.2 Fraction of users impacted 


ISPs that differentiate against BitTorrent traffic do not 
do so for every user. For each ISP in Table 3, Figure 10 
shows the fraction of users that tested positive for dif- 
ferentiation. We see that in the median case only 21% 
of users are affected. Given our tests’ low false posi- 
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Upstream Downstream 
app port|app port 


ISP Loc. 





Bell Canada (D) CA | x x 

Brasil Telecom (D) BR x x 
BT (D) UK | x x 

Cablecom (C) CH | x x 

Canaca (D) CA | x x 

City Telecom (F) HK | x x x x 
Clearwire (W) US | x x x x 
Cogeco (C) CA | x x 

EastLink (C) CA | x x 

Free (D) FR ~ 
GVT (D,F) BR x 

Kabel Deutschland (C) DE | x x _ x 
Magix (D) SG x x 
Oi (D) BR x 

ONO (C) ES x x 


Upstream Downstream 
app port/app port 


ISP Loc. 





PCCW (D) HK | x x x x 
Pipex (D) UK x 

Rogers (C) CA | x x 

Shaw (C) CA | x x 

TekSavvy (D) CA | x x 

Tele2 (D) IT x x 

Telenet (D) BE x x 

TEN (D) TW * x 
Tiscali Italia (D) IT x x 
Tiscali UK (D) UK ~ 

TM Net (D) MY | x x 

TVCABO (C) Pr | x x x x 
UPC NL (C) NL | x x 

UPC Poland (C) Ply.) x x 

UPC Romania (C) RO | x x 


Table 3: Top 30 ISPs based on the fraction of users that are affected by traffic differentiation during January 
and February 2009. The table shows if the flows are differentiated based on application content (app), TCP ports, or 
both. The letter in parenthesis gives the type of access network the ISP runs, 1.e., DSL (D), Cable (C), Fiber-To-The- 


Home (F), and WiMax (W). 
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Fraction of differentiating ISPs 
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Percentage of users affected by differentiation per ISP 


100% 


Figure 10: Typically, we detected traffic differentia- 
tion for only a fraction of an ISP’s users. 


tive and negative rates, this inconsistent impact within 
an ISP cannot be explained by inference errors alone. 

Our data does not allow us to infer why only a frac- 
tion of users of an ISP experience traffic differentiation. 
There are many possible reasons. An ISP might choose 
to target only customers who generate a lot of P2P traf- 
fic, the traffic shapers might be deployed in only a por- 
tion of the ISP network, or an ISP might differentiate 
only during peak hours or periods of high load. 


7.2.3 Dependence on time of day 


One potential explanation for why only some users ex- 
perience differentiation is that ISPs may differentiate 
only during peak hours, when the network is experienc- 
ing the greatest load. To investigate the dependence on 
time of day we divided our dataset into two time periods 
based on the local time of the user’. The peak period 


3We used an IP-to-geolocation tool to infer the timezone of each user. 
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is 8pm—12am, and the off-peak period is 5—9am. These 
periods are strict subsets of the peak and off-peak dura- 
tions for access ISPs [5, 14]. 

For each period we infer if an ISP differentiated traf- 
fic. Our analysis excludes ISPs that have fewer 100 mea- 
surements for either of the two time periods. This leaves 
us with 30 ISPs. We find that slightly more than half 
of these ISPs to differentiate during both peak and off- 
peak hours. The other ISPs, e.g., BT, Bell Canada, Ka- 
bel Deutschland, ONO, and Tiscali UK, restrict traffic 
differentiation to the peak period. 

Our results in the last two sections show the impor- 
tance of enabling end users to detect differentiation for 
themselves and at particular points in time. Many exist- 
ing tools attempt to discover whether or not a ISP dif- 
ferentiates traffic [27, 30]. Since not all users of an ISP 
are affected by differentiation all the time, ISP-wide in- 
formation alone is not sufficient for a user to determine 
if she experiences differentiation. 


7.3 User feedback 


Since our system became operational, we have received 
more than one hundred e-mails from users. The feed- 
back is overwhelmingly positive, and it reveals two 
pieces of information. First, we find evidence of false 
negatives in our results. Around 6% of our emailers 
were skeptical when Glasnost did not discover traffic 
differentiation. They were convinced that their ISP dif- 
ferentiates, sometimes based on information their ISP 
publishes. If these users are right, their cases con- 
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firm that our decision to minimize the false positive rate 
comes at the cost of false negatives. While we continue 
to investigate ways to reduce the false negative rate, we 
are pleased to report that no user has complained about 
the presence of a false positive. 

Second, some emails requested Glasnost tests for 
other P2P applications such as eMule as well as non-P2P 
applications such as FTP, SSH, and HTTP. The constant 
stream of such requests motivated us to open the Glas- 
nost platform and allow users to contribute new Glas- 
nost tests. We describe this extension in the following 
section. 


7.4  User-contributed Glasnost tests 


It is not feasible for us to create Glasnost tests for each 
of the large number of applications and possible traf- 
fic differentiation policies that are of interest to users. 
Hence, we decided to allow users to create their own 
Glasnost tests using the t race-emulate tool that we 
described earlier. To create a new test, users need to 
capture a packet trace of their target application using 
tcpdump and then use trace-emulate to create a 
new Glasnost test from the trace. These new tests can be 
uploaded to our measurement servers using the Glasnost 
webpage. Our interface for creating new tests is targeted 
not at lay users, but at advanced users who have some 
familiarity with capturing network traces. 

We have deployed this interface only recently, and we 
do not yet have a lot of experience with it. However, 
we asked a handful of our colleagues, who are doctoral 
students not associated with our project, to use the in- 
terface to create new Glasnost tests: they were able to 
create new tests quite easily. 


$8 Related Work 


This section describes Glasnost in the context of exist- 
ing work on traffic differentiation, trace replay, and mea- 
surement systems. 


Traffic Differentiation. Three early studies investi- 
gated the prevalence of blocking for BitTorrent [10, 11] 
or for general traffic based on port numbers [4]. They 
found blocking to be relatively common. Our results 
show that gentler forms of differentiation are now much 
more prevalent than outright blocking. 

Three recent efforts proposed techniques for detecting 
traffic differentiation. NetPolice [31] (previously named 
NVLens [30]) compares the aggregate loss rates of dif- 
ferent flows to infer the presence of “network neutrality 
violations” in backbone ISPs. In contrast, Glasnost fo- 
cuses on enabling individuals to detect whether they are 
subject to traffic differentiation. 


NANO [27] uses causal inference to infer the pres- 
ence of traffic performance degradation. NANO re- 
lies on a vast amount of passively collected traces from 
many users to infer if traversing a particular ISP leads to 
poorer performance for certain kinds of traffic. In con- 
trast, Glasnost uses active measurements and a simple 
head-to-head comparison of two flows to quickly inform 
users whether they face traffic differentiation — without 
relying on other users. However, adding passive mea- 
surement techniques to Glasnost might enable it to de- 
tect time- or usage-dependent traffic differentiation. 

DiffProbe [13] detects whether traffic differentiation 
based on active queue management (AQM), such as 
RED and weighted fair queueing, is deployed in the 
network path. DiffProbe complements Glasnost as it 
can detect differentiation that leads to small increase in 
latency and can identify the AQM technique used. If 
AQM affects application throughput, Glasnost can also 
detect this. 

Trace replay. Monkey [6] is a TCP replay tool that 
takes a packet-level trace as input and generates a new 
trace with similar network-level properties, such as la- 
tency and bandwidth. More recent work [8] investi- 
gates ways to infer higher-level protocols from low-level 
packet traces. Our trace-emulate tool is an adap- 
tation of such methods. 

Measurement systems. Many researchers use net- 
work testbeds, such as PlanetLab [24], RON [3], and 
NIMI [23], to conduct measurement studies. Unlike 
Glasnost, these testbeds are designed explicitly for use 
by researchers. There are a number of tools deployed on 
M-Lab [16] with the goal of enhancing Internet trans- 
parency. Most of them are generic measurement tools 
that characterize certain features of the Internet 

The DIMES project [9] is based on the SETI@home 
model. It uses volunteer-contributed hosts to run 
traceroute measurements that are used to map 
the connectivity of edge networks. The two systems, 
DIMES and Glasnost, offer an interesting (if unfair due 
to different goals) comparison of user models. DIMES 
relies on the ability to run arbitrary code on users’ com- 
puters. It was deployed over four years ago and has 
about 8,000 users. 

Finally, Netalyzr [17] 1s a web-based measurement 
tool that mostly focuses on the detection of network- 
ing problems. Like Glasnost, it targets lay users with an 
easy-to-use interface and allows them to detect, for in- 
stance, manipulation of web content by a HTTP proxy in 
the path or blocking of traffic on some prominent ports. 


9 Conclusion 


We described Glasnost, a system that we deployed more 
than a year ago to let ordinary users detect traffic dif- 
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ferentiation along their paths. More than 350,000 users 
from over 5,800 ISPs worldwide have used it to detect 
BitTorrent differentiation. We believe that our focus on 
making it easy for lay users to use the system and to 
understand its results have led to its success. Using the 
data gathered by Glasnost, we also presented what to our 
knowledge is the first detailed analysis of BitTorrent dif- 
ferentiation practices in the Internet. The data collected 
by Glasnost is available through M-Lab [16]. 


Over the past year, we have encountered many re- 
searchers who were skeptical about the benefits of mea- 
suring traffic differentiation. Even some of this paper’s 
authors were initially skeptical. A common argument is 
that, since traffic differentiation is attracting so much at- 
tention from industry and the government, the permissi- 
ble practices would soon be standardized and apparent. 
The skeptics might or might not be right. But the pop- 
ularity of Glasnost and the positive feedback shows that 
many users are curious about the behavior of their Inter- 
net paths. Indeed, Glasnost’s impact goes beyond traffic 
differentiation in our view. Its design shows one effec- 
tive way to build and deploy a measurement system that 
satisfies such curiosities and makes the network more 
transparent to its users. 
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Abstract 


In many enterprises today, WAN optimizers are be- 
ing deployed in order to eliminate redundancy in net- 
work traffic and reduce WAN access costs. In this pa- 
per, we present the design and implementation of En- 
dRE, an alternate approach where redundancy elimina- 
tion (RE) is provided as an end system service. Unlike 
middleboxes, such an approach benefits both end-to-end 
encrypted traffic as well as traffic on last-hop wireless 
links to mobile devices. 

EndRE needs to be fast, adaptive and parsimonious in 
memory usage in order to opportunistically leverage re- 
sources on end hosts. Thus, we design a new fingerprint- 
ing scheme called SampleByte that is much faster than 
Rabin fingerprinting while delivering similar compres- 
sion gains. Unlike Rabin fingerprinting, SampleByte can 
also adapt its CPU usage depending on server load. Fur- 
ther, we introduce optimizations to reduce server mem- 
ory footprint by 33-75% compared to prior approaches. 
Using several terabytes of network traffic traces from 
11 enterprise sites, testbed experiments and a pilot de- 
ployment, we show that EndRE delivers 26% bandwidth 
savings on average, processes payloads at speeds of 1.5- 
4Gbps, reduces end-to-end latencies by up to 30%, and 
translates bandwidth savings into equivalent energy sav- 
ings on mobile smartphones. 


1 Introduction 


With the advent of globalization, networked services 
have a global audience, both in the consumer and en- 
terprise spaces. For example, a large corporation today 
may have branch offices at dozens of cities around the 
globe. In such a setting, the corporation’s IT admins 
and network planners face a dilemma. On the one hand, 
they could concentrate IT servers at a small number of 
locations. This might lower administration costs, but in- 
crease network costs and latency due to the resultant in- 
crease in WAN traffic. On the other hand, servers could 
be located closer to clients; however, this would increase 
operational costs. 


'A part of this work was done while the authors were interns at 
Microsoft Research India. 

*The author was a visiting researcher at Microsoft Research India 
during the course of this work. 


This paper arises from the quest to have the best of 
both worlds, specifically, having the operational benefits 
of centralization along with the performance benefits of 
distribution. In recent years, protocol-independent re- 
dundancy elimination, or simply RE [20], has helped 
bridge the gap by making WAN communication more 
efficient through elimination of redundancy in traffic. 
Such compression is typically applied at the IP or TCP 
layers, for instance, using a pair of middleboxes placed 
at either end of a WAN link connecting a corporation’s 
data center and a branch office. Each box caches pay- 
loads from flows that traverse the link, irrespective of the 
application or protocol. When one box detects chunks of 
data that match entries in its cache (by computing “‘fin- 
gerprints” of incoming data and matching them against 
cached data), it encodes matches using tokens. The box 
at the far end reconstructs original data using its own 
cache and the tokens. This approach has seen increasing 
deployment in “WAN optimizers”. 

Unfortunately, such middlebox-based solutions face 
two key drawbacks that impact their long-term use- 
fulness: (1) Middleboxes do not cope well with end- 
to-end encrypted traffic and many leave such data un- 
compressed (e.g., [1]). Some middleboxes accommo- 
date SSL/SSH traffic with techniques such as connection 
termination and sharing of encryption keys (e.g., [5]), 
but these weaken end-to-end semantics. (2) In-network 
middleboxes cannot improve performance over last-hop 
links in mobile devices. 


As end-to-end encryption and mobile devices become 
increasingly prevalent, we believe that RE will be forced 
out of middleboxes and directly into end-host stacks. 
Motivated by this, we explore a new point in the design 
space of RE proposals — an end-system redundancy 
elimination service called EndRE. EndRE could supple- 
ment or supplant middlebox-based techniques while ad- 
dressing their drawbacks. Our paper examines the costs 
and benefits that EndRE implies for clients and servers. 


Effective end-host RE requires looking for small re- 
dundant chunks of the order of 32-64 bytes (because 
most enterprise transfers involve just a few packets 
each [16]). The standard Rabin fingerprinting algo- 
rithms (e.g., [20]) for identifying such fine scale redun- 
dancy are very expensive in terms of memory and pro- 
cessing especially on resource constrained clients such 
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as smartphones. Hence, we adopt a novel asymmetric 
design that systematically offloads as much of process- 
ing and memory to servers as possible, requiring clients 
to do no more than perform basic FIFO queue man- 
agement of a small amount of memory and do simple 
pointer lookups to decode compressed data. 

While client processing and memory are paramount, 
servers in EndRE need to do other things as well. This 
means that server CPU and memory are also crucial 
bottlenecks in our asymmetric design. For server pro- 
cessing, we propose a new fingerprinting scheme called 
SampleByte that is much faster than Rabin fingerprint- 
ing used in traditional RE approaches while delivering 
similar compression. In fact, SampleByte can be up to 
10X faster, delivering compression speeds of 1.5-4Gbps. 
SampleByte is also tunable in that it has a payload sam- 
pling parameter that can be adjusted to reduce server 
processing if the server is busy, at the cost of reduced 
compression gains. 

For server storage, we devise a suite of highly- 
optimized data structures for managing meta-data and 
cached payloads. For example, our Max-Match vari- 
ant of EndRE (85.2.2) requires 33% lower memory 
compared to [20]. Our Chunk-Match variant (85.2.1) 
cuts down the aggregate memory requirements at the 
server by 4X compared to [20], while sacrificing a small 
amount of redundancy. 

We conduct a thorough evaluation of EndRE. We an- 
alyze several terabytes of traffic traces from 11 different 
enterprise sites and show that EndRE can deliver signifi- 
cant bandwidth savings (26% average savings) on enter- 
prise WAN links. We also show significant latency and 
energy savings from using EndRE. Using a testbed over 
which we replay enterprise HTTP traffic, we show that 
latency savings of up to 30% are possible from using 
EndRE, since it operates above TCP, thereby reducing 
the number of roundtrips needed for data transfer. Simi- 
larly, on mobile smartphones, we show that the low de- 
coding overhead on clients can help translate bandwidth 
savings into significant energy savings compared to no 
compression. We also report results from a small-scale 
deployment of EndRE in our lab. 

The benefits of EndRE come at the cost of memory 
and CPU resources on end systems. We show that a 
median EndRE client needs only 60MB of memory and 
negligible amount of CPU. At the server, since EndRE is 
adaptive, it can opportunistically trade-off CPU/memory 
for compression savings. 

In summary, we make the following contributions: 

(1) We present the design of EndRE, an end host 
based redundancy elimination service (84). 

(2) We present new asymmetric RE algorithms and 
optimized data structures that limit client processing and 
memory requirements, and reduce server memory us- 
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age by 33-75% and processing by 10X compared to [20] 
while delivering slightly lower bandwidth savings (85). 

(3) We present an implementation of EndRE as part of 
Windows Server/7/Vista as well as on Windows Mobile 
6 operating systems (96). 

(4) Based on extensive analysis using several ter- 
abytes of network traffic traces from 11 enterprise sites, 
testbed experiments and a small-scale deployment, we 
quantify the benefits and costs of EndRE (87 - 89) 


2 Related Work 


Over the years, enterprise networks have used a variety 
of mechanisms to suppress duplicate data from their net- 
work transfers. We review these mechanisms below. 

Classical approaches: The simplest RE approach is 
to compress objects end-to-end. It is also the least ef- 
fective because it does not exploit redundancy due to 
repeated accesses of similar content. Object caches 
can help in this regard, but they are unable to extract 
cross-object redundancies [8]. Also object caches are 
application-specific in nature; e.g., Web caches cannot 
identify duplication in other protocols. Furthermore, an 
increasing amount of data is dynamically generated and 
hence not cacheable. For example, our analysis of enter- 
prise traces shows that a majority of Web objects are not 
cacheable, and deploying an HTTP proxy would only 
yield 5% net bandwidth savings. Delta encoding can 
eliminate redundancy of one Web object with respect 
to another [14, 12]. However, like Web caches, delta 
encoding is application-specific and ineffective for dy- 
namic content. 

Content-based naming: The basic idea underlying 
EndRE is that of content-based naming [15, 20], where 
an object is divided into chunks and indexed by com- 
puting hashes over chunks. Rabin Fingerprinting [18] is 
typically used to identify chunk boundaries. In file sys- 
tems such as LBEFS [15] and Shark [9], content-based 
naming is used to identify similarities across different 
files and across versions of the same file. Only unique 
chunks are transmitted between file servers and clients, 
resulting in lower bandwidth consumption. A similar 
idea is used in value-based Web caching [19], albeit be- 
tween a Web server and its client. Our chunk-based En- 
dRE design is patterned after this approach, with key 
modifications for efficiency (85). 

Generalizing these systems, DOT [21] proposes a 
“transfer service” as an interface between applications 
and the network. Applications pass the object they want 
to send to DOT. Objects are split into chunks and the 
sender sends chunk hashes to the receiver. The receiver 
maintains a cache of earlier received chunks and re- 
quests only the chunks that were not found in its cache 
or its neighbors’ caches. Thus, DOT can leverage TBs 
of cache in the disks of an end host and its peers to elim- 
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inate redundancy. Similarly, SET [17] exploits chunk- 
level similarity in downloading related large files. DOT 
and SET use an average chunk size of 2KB or more. 
These approaches mainly benefit large transfers; the ex- 
tra round trips that can only be amortized over the trans- 
fer lengths. In contrast, EndRE identifies redundancy 
across chunk sizes of 32 bytes and does not impose ad- 
ditional latency. It is also limited to main-memory based 
caches of size 1-10MB per pair of hosts (85). Thus, En- 
dRE and DOT complement each other. 

Protocol-independent WAN optimizers. To over- 
come the limitations of the “classical” approaches, en- 
terprises have moved increasingly toward protocol inde- 
pendent RE techniques, used in WAN optimizers. These 
WAN optimizers can be of two types, depending on 
which network layer they operate at, namely, IP layer 
devices [20, 3] or higher-layer devices [1, 5]. 

In either case, special middleboxes are deployed at ei- 
ther end of a WAN link to index all content exchanged 
across the link, and identify and remove partial redun- 
dancies on the fly. Rabin fingerprinting [18] is used to 
index content and compute overlap (similar to [20, 15]). 
Both sets of techniques are highly effective at reduc- 
ing the utilization of WAN links. However, as men- 
tioned earlier, they suffer from two key limitations, 
namely, lack of support for end-to-end encryption and 
for resource-constrained mobile devices. 


3 Motivation 


In exploring an end-point based RE service, one of the 
main issues we hope to address is whether such a ser- 
vice can offer bandwidth savings approaching that of 
WAN optimizers. To motivate the likely benefits of an 
end-point based RE service, we briefly review two key 
findings from our earlier study [8] of an IP-layer WAN 
optimizer [7]. 

First, we seek to identify the origins of redundancy. 
Specifically, we classify the contribution of redundant 
byte matches to bandwidth savings as either intra-host 
(current and matched packet in cache have identical 
source-destination IP addresses) or inter-host (current 
and matched packets differ in at least one of source or 
destination IP addresses). We were limited to a 250MB 
cache size given the large amount of meta-data neces- 
sary for this analysis, though we saw similar compres- 
sion savings for cache sizes up to 2GB. Surprisingly, 
our study revealed that over 75% of savings were from 
intra-host matches. This implies that a pure end-to-end 
solution could potentially deliver a significant share of 
the savings obtained by an IP WAN optimizer, since the 
contribution due to inter-host matches is small. How- 
ever, this finding holds good only if end systems operate 
with similar (large) cache sizes as middleboxes, which 
is impractical. This brings us to the second key finding. 


Examining the temporal characteristics of redundancy, 
we found that the redundant matches in the WAN opti- 
mizer displayed a high degree of temporal locality with 
60-80% of middlebox savings arising from matches with 
packets in the most recent 10% of the cache. This im- 
plies that small caches could capture a bulk of the sav- 
ings of a large cache. 

Taken together, these two findings suggest that an end 
point-based RE system with a small cache size can in- 
deed deliver a significant portion of the savings of a 
WAN optimizer, thus motivating the design of EndRE. 

Finally, note that, the focus of comparison in this sec- 
tion is between an IP-layer WAN optimizer with an in- 
memory cache (size is O(GB)) and an end-system solu- 
tion. The first finding is not as surprising once we realize 
that the in-memory cache gets recycled frequently (on 
the order of tens of minutes) during peak hours on our 
enterprise traces, limiting the possibility for inter-host 
matches. A WAN optimizer typically also has a much 
larger on-disk cache (size is O(7'B)) which may see a 
large fraction of inter-host matches; an end-system disk 
cache-based solution such as DOT [21] could capture 
analogous savings. 


4 Design Goals 


EndRE is designed to optimize data transfers in the di- 
rection from servers in a remote data center to clients 
in the enterprise, since this captures a majority of enter- 
prise traffic. We now list five design goals for EndRE — 
the first two design goals are shared to some extent by 
prior RE approaches, but the latter three are unique to 
EndRE. 

1. Transparent operation: For ease of deploy-ability, 
the EndRE service should require no changes to existing 
applications run within the data center or on clients. 

2. Fine-grained operation: Prior work has shown that 
many enterprise network transfers involve just a few 
packets [16]. To improve end-to-end latencies and pro- 
vide bandwidth savings for such short flows, EndRE 
must work at fine granularities, suppressing duplicate 
byte strings as small as 32-64B. This is similar to [20], 
but different from earlier proposals for file-systems [15] 
and Web caches [19] where the sizes of redundancies 
identified are 2-4KB. 

3. Simple decoding at clients: EndRE’s target client set 
includes battery- and CPU-constrained devices such as 
smart-phones. While working on fine granularities can 
help identify greater amounts of redundancy, it can also 
impose significant computation and decoding overhead, 
making the system impractical for these devices. Thus, 
a unique goal is to design algorithms that limit client 
overhead by offloading all compute-intensive actions to 
servers. 

4. Fast and adaptive encoding at servers: EndRE is 
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designed to opportunistically leverage CPU resources on 
end hosts when they are not being used by other appli- 
cations. Thus, unlike commercial WAN optimizers and 
prior RE approaches [20], EndRE must adapt its use of 
CPU based on server load. 

5. Limited memory footprint at servers and clients: 
EndRE relies on data caches to perform RE. However, 
memory on servers and clients could be limited and may 
be actively used by other applications. Thus, EndRE 
must use as minimal memory on end-hosts as possible 
through the use of optimized data structures. 


5 EndRE Design 


In this section, we describe how EndRE’s design meets 
the above goals. 

EndRE introduces RE modules into the network 
stacks of clients and remote servers. Since we wish to 
be transparent to applications, EndRE could be imple- 
mented either at the IP-layer or at the socket layer (above 
TCP). As we argue in 86, we believe that socket layer 
is the right place to implement EndRE. Doing so offers 
key performance benefits over an IP-layer approach, and 
more importantly, shields EndRE from network-level 
events (e.g., packet losses and reordering), making it 
simpler to implement. 

There are two sets of modules in EndRE, those be- 
longing on servers and those on clients. The server-side 
module is responsible for identifying redundancy in net- 
work data by comparing against a cache of prior data, 
and encoding the redundant data with shorter meta-data. 
The meta-data is essentially a set of <offset, length> 
tuples that are computed with respect to the client-side 
cache. The client-side module is trivially simple: it con- 
sists of a fixed-size circular FIFO log of packets and sim- 
ple logic to decode the meta-data by “de-referencing”’ 
the offsets sent by the server. Thus, most of the com- 
plexity in EndRE is mainly on the server side and we 
focus on that here. 

Identifying and removing redundancy is typically ac- 
complished [20, 7] by the following two steps: 

e Fingerprinting: Selecting a few “representative re- 
gions” for the current block of data handed down by ap- 
plication(s). We describe four fingerprinting algorithms 
in 85.1 that differ in the trade-off they impose between 
computational overhead on the server and the effective- 
ness of RE. 

e Matching and Encoding: Once the representative re- 
gions are identified, we examine two approaches for 
identification of redundant content in 85.2: (1) Identi- 
fying chunks of representative regions that repeat in full 
across data blocks, called Chunk-Match and (2) Iden- 
tifying maximal matches around the representative re- 
gions that are repeated across data blocks, called Max- 
Match. These two approaches differ in the trade-off be- 
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tween the memory overhead imposed on the server and 
the effectiveness of RE. 


Next, we describe EndRE’s design in detail, starting 
with selection of representative regions, and moving on 
to matching and encoding. 


5.1 Fingerprinting: Balancing Server 
Computation with Effectiveness 


In this section, we outline four approaches for identify- 
ing the representative payload regions at the server that 
vary in the way they trade-off between computational 
overhead and the effectiveness of RE. In some of the 
approaches, computational overhead can be adaptively 
tuned based on server CPU load, and the effectiveness 
of RE varies accordingly. Although three of the four ap- 
proaches were proposed earlier, the issue of their com- 
putational overhead has not received enough attention. 
Since this issue is paramount for EndRE, we consider it 
in great depth here. We also propose a new approach, 
SAMPLEBYTE, that combines the salient aspects of 
prior approaches. 


We first introduce some notation and terminology to 
help explain the approaches. Restating from above, a 
“data block” or simply a “block” is a certain amount of 
data handed down by an application to the EndRE mod- 
ule at the socket layer. Each data block can range from 
a few bytes to tens of kilobytes in size. 


Let w represent the size of the minimum redundant 
string (contiguous bytes) that we would like to iden- 
tify. For a data block of size S bytes, S > w, a total 
of S — w+ 1 strings of size w are potential candidates 
for finding a match. Typical values for w range from 12 
to 64 bytes. Based on our findings of redundant match 
length distribution in [8], we choose a default value of 
w = 32 bytes to maximize the effectiveness of RE. Since 
S >> w, the number of such candidate strings is on the 
order of the number of bytes in the data block/cache. 
Since it is impractical to match/store all possible can- 
didates, a fraction 1/p “representative” candidates are 
chosen. 

Let us define markers as the first byte of these chosen 
candidate strings and chunks as the string of bytes be- 
tween two markers. Let fingerprints be a pseudo-random 
hash of fixed w-byte strings beginning at each marker 
and chunk-hashes be hashes of the variable sized chunks. 
Note that two fingerprints may have overlapping bytes; 
however, by definition, chunks are disjoint. The differ- 
ent algorithms, depicted in Figure | and discussed be- 
low, primarily vary in the manner in which they choose 
the markers, from which one can derive chunks, finger- 
prints, and chunk-hashes. As we discuss later in 85.2, the 
Chunk-Match approach uses chunk-hashes while Max- 
Match uses fingerprints. 
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Figure 1: Fingerprinting algorithms with chunks, mark- 
ers and fingerprints; chunk-hashes, not shown, can be 
derived from chunks 







1) /Let w = 32; p = 32; Assume len > w; 
2| //RabinHash() computes RABIN hash over a w byte window 
3| MODP (data, len) 


4 for(i = 0;2 < w—1;i+-4) 
5 fingerprint = RabinHash(data|z)); 
6 fora = w — 132 < len;2+ +) 
7 fingerprint = RabinHash(data|i)); 
8 if (fingerprint % p == 0) //MOD 
9 marker =i1-—wt+l; 
10 store marker, fingerprint in table; 


Figure 2: MODP Fingerprinting Algorithm 


5.1.1 MODP 


In the “classical” RE approaches [20, 7, 15], the set of 
fingerprints are chosen by first computing Rabin-Karp 
hash [18] over sliding windows of w contiguous bytes 
of the data block. A fraction 1/p are chosen whose 
fingerprint value is 0 mod p. Choosing fingerprints in 
this manner has the advantage that the set of represen- 
tative fingerprints for a block remains mostly the same 
despite small amount of insertions/deletions/reorderings 
since the markers/fingerprints are chosen based on con- 
tent rather than position. 


Note that two distinct operations — marker identifica- 
tion and fingerprinting — are both handled by the same 
hash function here. While this appears elegant, it has 
a cost. Specifically, the per block computational cost is 
independent of the sampling period, p (lines 4—7 in Fig- 
ure 2). Thus, this approach cannot adapt to server CPU 
load conditions (e.g., by varying p). Note that, while the 
authors of [20] report some impact of p on processing 
speed, this impact is attributed to the overhead of man- 
aging meta-data (line 10). We devise techniques in 85.2 
to significantly reduce the overhead of managing meta- 
data, thus, making fingerprint computation the main bot- 
tleneck. 


— 








1| /Let w = 32; p = 32; Assume len > w; 
2| //SAMPLETABLE[i] maps byte 1 to either 0 or 1 
3| //Jenkinshash() computes hash over a w byte window 
4| SAMPLEBYTE(data, len) 


5 for(4 = 0;2 < len — w32+ +) 

6 if (SAMPLETABLE[data|i]] == 1) 

a marker = 1; 

8 fingerprint = JenkinsHash(data + 2); 
9 store marker, fingerprint in table; 

0 L=i=- p/ 2 


Figure 3: SAMPLEBYTE Fingerprinting Algorithm 
5.1.22 MAXP 


Apart from the conflation of marker identification and 
fingerprinting, another shortcoming of the MODP ap- 
proach is that the fingerprints/markers are chosen based 
on a global property, 1.e., fingerprints have to take cer- 
tain pre-determined values to be chosen. The markers 
for a given block may be clustered and there may be 
large intervals without any markers, thus, limiting re- 
dundancy identification opportunities. To guarantee that 
an adequate number of fingerprints/markers are chosen 
uniformly from each block, markers can be chosen as 
bytes that are local-maxima over each region of p bytes 
of the data block [8]. Once the marker byte is chosen, an 
efficient hash function such as Jenkins Hash [2] can be 
used to compute the fingerprint. By increasing p, fewer 
maxima-based markers need to be identified, thereby re- 
ducing CPU overhead. 


5.1.3 FIXED 


While markers in both MODP and MAXP are chosen 
based on content of the data block, the computation of 
Rabin hashes and local maxima can be expensive. A 
simpler approach is to be content-agnostic and simply 
select every p'” byte as a marker. Since markers are sim- 
ply chosen by position, marker identification incurs no 
computational cost. Once markers are chosen, S/p fin- 
gerprints are computed using Jenkins Hash as in MAXP. 
While this technique is very efficient, its effectiveness 
in RE is not clear as it is not robust to small changes in 
content. While prior works in file systems (e.g., [15]), 
where cache sizes are large (O(T B)), argue against this 
approach, it is not clear how ineffective FIXED will be 
in EndRE where cache sizes are small (O(M/B)). 


5.1.4 SAMPLEBYTE 


MAXP and MODP are content-based and thus robust 
to small changes in content, while FIXED is content- 
agnostic but computationally efficient. We designed 
SAMPLEBYTE (Figure 3) to combine the robustness of 
a content-based approach with the computational effi- 
ciency of FIXED. It uses a 256-entry lookup table with 
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a few predefined positions set. As the data block is 
scanned byte-by-byte (line 5), a byte is chosen as a 
marker if the corresponding entry in the lookup table is 
set (line 6—7). Once a marker is chosen, a fingerprint is 
computed using Jenkins Hash (line 8), and p/2 bytes of 
content are skipped (line 10) before the process repeats. 
Thus, SAMPLEBYTE is content-based, albeit based on 
a single byte, while retaining the content-skipping and 
the computational characteristics of FLXED. 


One clear concern is whether such a naive marker 
identification approach will do badly and cause the al- 
gorithm to either over-sample or under-sample. First, 
note that MODP with 32-64 byte rolling hashes was 
originally used in file systems [15] where chunk sizes 
were large (2-4KB). Given that we are interested in sam- 
pling as frequent as every 32-64 bytes, sampling chunk 
boundaries based on |-byte content values is not as rad- 
ical as it might first seem. Also, note that if x entries 
of the 256-entry lookup table are randomly set (where 
256/a = p), then the expected sampling frequency is in- 
deed 1/p. In addition, SAMPLEBYTE skips p/2 bytes 
after each marker selection to avoid oversampling when 
the content bytes of data block are not uniformly dis- 
tributed (e.g., when the same content byte is repeated 
contiguously). Finally, while a purely random selection 
of 256/z entries does indeed perform well in our traces, 
we use a lookup table derived based on the heuristic de- 
scribed below. This heuristic outperforms the random 
approach and we have found it to be effective after ex- 
tensive testing on traces (see 88). 


Since the number of unique lookup tables is large 
(27°°), we use an offline, greedy approach to generate 
the lookup table. Using network traces from one of the 
enterprise sites we study as training data (site 11 in Ta- 
ble 2), we first run MAXP to identify redundant con- 
tent and then sort the characters in descending order of 
their presence in the identified redundant content. We 
then add these characters one at a time, setting the cor- 
responding entries in the lookup table to 1, and stop this 
process when we see diminishing gains in compression. 
The intuition behind this approach is that characters that 
are more likely to be part of redundant content should 
have a higher probability of being selected as markers. 
The characters selected from our training data were 0, 
32, 48, 101, 105, 115, 116, 255. While our current ap- 
proach results in a static lookup table, we are looking at 
online dynamic adaptation of the table as part of future 
work. 


Since SAMPLEBYTE skips p/2 bytes after every 
marker selection, the fraction of markers chosen is < 
2 /p, irrespective of the number of entries set in the table. 
By increasing p, fewer markers/fingerprints are chosen, 
resulting in reduced CPU overhead. 
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over variable sized chunks 
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2. Hashes are looked up in 
Chunk-hash store for match Chunk-hash store 


Figure 4: Chunk-Match: only chunk-hashes stored 


5.2 Matching and Encoding: Optimizing 
Storage and Client Computation 


Once markers and fingerprints are identified, identifica- 
tion of redundant content can be accomplished in two 
ways: (1) Identifying chunks of data that repeat in full 
across data blocks, called Chunk-Match, or (2) Identi- 
fying maximal matches around fingerprints that are re- 
peated across data blocks, called Max-Match. Both tech- 
niques were proposed in prior work: the former in the 
context of file systems [15] and Web object compres- 
sion [19], and the latter in the context of IP WAN opti- 
mizer [20]. However, prior proposals impose significant 
storage and CPU overhead. 

In what follows we describe how the overhead im- 
pacts both servers and clients, and the two techniques we 
employ to address these overheads. The first technique is 
to leverage asymmetry between servers and clients. We 
propose that clients offload most of the computationally 
intensive operations (e.g., hash computations) and mem- 
ory management tasks to the server. The second tech- 
nique is to exploit the inherent structure within the data 
maintained at servers and clients to optimize memory 
usage. 


5.2.1 Chunk-Match 


This approach (Figure 4) stores hashes of the chunks in 
a data block in a “Chunk-hash store”. Chunk-hashes 
from payloads of future data blocks are looked up in 
the Chunk-hash store to identify if one or more chunks 
have been encountered earlier. Once matching chunks 
are identified, they are replaced by meta-data. 

Although similar approaches were used in prior sys- 
tems, they impose significantly higher overhead if em- 
ployed directly in EndRE. For example, in LBFS [15], 
clients have to update their local caches with map- 
pings between new content-chunks and corresponding 
content-hashes. This requires expensive SHA-1 hash 
computation at the client. Value-based web caching [19] 
avoids the cost of hash computation at the client by hav- 
ing the server send the hash with each chunk. However, 
the client still needs to store the hashes, which is a sig- 
nificant overhead for small chunk sizes. Also, sending 
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hashes over the network adds significant overhead given 
that the hash sizes (20 bytes) are comparable to average 
chunk sizes in EndRE (32-64 bytes). 

EndRE optimizations: We employ two ideas to im- 
prove overhead on clients and servers. 

(1) Our design carefully offloads all storage manage- 
ment and computation to servers. A client simply main- 
tains a fixed-size circular FIFO log of data blocks. The 
server emulates client cache behavior on a per-client ba- 
sis, and maintains within its Chunk-hash store a mapping 
of each chunk hash to the start memory addresses of the 
chunk in a client’s log along with the length of the chunk. 
For each matching chunk, the server simply encodes and 
sends a four-byte <offset, length> tuple of the chunk 
in the client’s cache. The client simply “‘de-references” 
the offsets sent by the server and reconstructs the com- 
pressed regions from local cache. This approach avoids 
the cost of storing and computing hashes at the client, as 
well as the overhead of sending hashes over the network, 
at the cost of slightly higher processing and storage at the 
server end. 

(2) In traditional Chunk-Match approaches, the server 
maintains a log of the chunks locally. We observe that 
the server only needs to maintain an up-to-date chunk- 
hash store, but it does not need to store the chunks them- 
selves as long as the chunk hash function is collision re- 
sistant. Thus, when a server computes chunks for a new 
data block and finds that some of the chunks are not at 
the client by looking up the chunk-hash store, it inserts 
mappings between the new chunk hashes and their ex- 
pected locations in the client cache. 

In our implementation, we use SHA-1 to compute 160 
bit hashes, which has good collision-resistant properties. 
Let us now compute the storage requirements for Chunk- 
Match assuming a sampling period p of 64 bytes and a 
cache size of 16OMB. The offset to the 16MB cache can 
be encoded in 24 bits and the length encoded in 8 bits 
assuming the maximum length of a chunk is limited to 
256 bytes (recall that chunks are variable sized). Thus, 
server meta-data storage is 24 bytes per 64-byte chunk, 
comprising 4-bytes for the <offset, length> tuple and 
20-bytes for SHA-1 hash. This implies that server mem- 
ory requirement is about 38% of the client cache size. 


5.2.2 Max-Match 


A drawback of Chunk-Match is that it can only detect 
exact matches in the chunks computed for a data block. 
It could miss redundancies that, for instance, span con- 
tiguous portions of neighboring chunks or redundancies 
that only span portions of chunks. An alternate ap- 
proach, called Max-Match, proposed for IP WAN op- 
timizer [20, 7] and depicted in Figure 5, can identify 
such redundancies, albeit at a higher memory cost at the 
server. 
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Figure 5: Max-Match: matched region is expanded 
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Table 1: 1MB Fingerprint store for 16MB cache 


In Max-Match, fingerprints computed for a data block 
serve as random “hooks” into the payload around which 
more redundancies can be identified. The computed fin- 
gerprints for a data block are compared with a “‘finger- 
print store” that holds fingerprints of all past data blocks. 
For each matching fingerprint, the corresponding match- 
ing data block is retrieved from the cache and the match 
region is expanded byte-by-byte in both directions to ob- 
tain the maximal region of redundant bytes (Figure 5). 
Matched regions are then encoded with <offset, length> 
tuples. 

EndRE optimizations: We employ two simple ideas 
to improve the server computation and storage overhead. 

First, since Max-Match relies on byte-by-byte com- 
parison to identify matches, fingerprint collisions are 
not costly; any collisions will be recovered via an ex- 
tra memory lookup. This allows us to significantly limit 
fingerprint store maintenance overhead for all four al- 
gorithms since fingerprint values are simply overwritten 
without separate bookkeeping for deletion. Further, a 
simple hash function that generates a few bytes of hash 
value as a fingerprint (e.g., Jenkins hash [2]) is sufficient. 

Second, we optimize the representation of the finger- 
print hash table to limit storage needs. Since the map- 
ping is from a fingerprint to an offset value, the finger- 
print itself need not be stored in the table, at least in its 
entirety. The index into the fingerprint table can implic- 
itly represent part of the fingerprint and only the remain- 
ing bits, if any, of the fingerprint that are not covered by 
the index can be stored in the table. In the extreme case, 
the fingerprint table is simply a contiguous set of offsets, 
indexed by the fingerprint hash value. 

Table 1 illustrates the fingerprint store for a cache size 
of 16MB and p = 64. In this case, the number of finger- 
prints to index the entire cache is simply 274/64 or 218. 
Using a table size of 2'® implies that 18 bits of a finger- 
print are implicitly stored as the index of the table. The 
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offset size necessary to represent the entire cache is 24 
bits. Assuming we store an additional 8 bits of the fin- 
gerprint as part of the table, the entire fingerprint table 
can be compactly stored in a table of size 2!® « 4 bytes, 
or 6% of the cache size. A sampling period of 32 would 
double this to 12% of the cache size. This leads to a 
significant reduction in fingerprint meta-data size com- 
pared to the 67% indexing overhead in [20] or the 50% 
indexing overhead in [7]. 

These two optimizations are not possible in the case of 
Chunk-Match due to the more stringent requirements on 
collision-resistance of chunk hashes. However, server 
memory requirement for Chunk-Match is only 38% 
of client cache size, which is still significantly lower 
than 106% of the cache size (cache + fingerprint store) 
needed for Max-Match. 


6 Implementation 


In this section, we discuss our implementation of En- 
dRE. We start by discussing the benefits of implement- 
ing EndRE at the socket layer above TCP. 


6.1 Performance benefits 


Bandwidth: In the socket-layer approach, RE can oper- 
ate at the size of socket writes which are typically larger 
than IP layer MTUs. While Max-Match and Chunk- 
Match do not benefit from these larger sized writes since 
they operate at a granularity of 32 bytes, the large size 
helps produce higher savings if a compression algorithm 
like GZIP is additionally applied, as evaluated in 89.1. 
Latency: The socket-layer approach will result in fewer 
packets transiting between server and clients, as opposed 
to the IP layer approach which merely compresses pack- 
ets without reducing their number. This is particularly 
useful in lowering completion times for short flows, as 
evaluated in 89.2. 


6.2 End-to-end benefits 


Encryption: When using socket-layer RE, payload en- 
crypted in SSL can be compressed before encryption, 
providing RE benefits to protocols such as HTTPS. IP- 
layer RE will leave SSL traffic uncompressed. 

Cache Synchronization: Recall that both Max-Match 
and Chunk-Match require caches to be synchronized be- 
tween clients and servers. One of the advantages of im- 
plementing EndRE above TCP is that TCP ensures reli- 
able in-order delivery, which can help with maintaining 
cache synchronization. However, there are still two is- 
sues that must be addressed. 

First, multiple simultaneous TCP connections may be 
operating between a client and a server, resulting in or- 
dering of data across connections not being preserved. 
To account for this, we implement a simple sequence 
number-based reordering mechanism. 
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Second, TCP connections may get reset in the mid- 
dle of a transfer. Thus, packets written to the cache at 
the server end may not even reach the client, leading 
to cache inconsistency. One could take a pessimistic or 
optimistic approach to maintaining consistency in this 
situation. In the pessimistic approach, writes to the 
server cache are performed only after TCP ACKs for 
corresponding segments are received at the server. The 
server needs to monitor TCP state, detect ACKs, per- 
form writes to its cache and notify the client to do the 
same. In the optimistic approach, the server writes to the 
cache but monitors TCP only for reset events. In case of 
connection reset (receipt of a TCP RST from client or a 
local TCP timeout), the server simply notifies the client 
of the last sequence number that was written to the cache 
for the corresponding TCP connection. It is then the 
client’s responsibility to detect any missing packets and 
recover these from the server. We adopt the optimistic 
approach of cache writing for two reasons: (1) Our re- 
dundancy analysis [8] indicated that there is high tem- 
poral locality of matches; a pessimistic approach over 
a high bandwidth-delay product link can negatively im- 
pact compression savings; (2) The optimistic approach 
is easier to implement since only for reset events need to 
be monitored rather than every TCP ACK. 


6.3. Implementation 


We have implemented EndRE above TCP in Windows 
Server/Vista/7. Our default fingerprinting algorithm is 
SAMPLEBYTE with a sampling period, p = 32. Our 
packet cache is a circular buffer 1-16MB in size per pairs 
of IP addresses. Our fingerprint store is also allocated 
a bounded memory based on the values presented ear- 
lier. We use a simple resequencing buffer with a prior- 
ity queue to handle re-ordering across multiple parallel 
TCP streams. At the client side, we maintain a fixed size 
circular cache and the decoding process simply involves 
lookups of specified data segments in the cache. 

In order to enable protocol independent RE, we trans- 
parently capture application payloads on the server side 
and TCP payloads at the client side at the TCP stream 
layer, that lies between the application layer and the TCP 
transport layer. We achieve this by implementing a ker- 
nel level filter driver based on Windows Filtering Plat- 
form (WFP) [6]. This implementation allows seamless 
integration of EndRE with all application protocols that 
use TCP, with no modification to application binaries or 
protocols. We also have a management interface that 
can be used to restrict EndRE only to specific applica- 
tions. This is achieved by predicate-based filtering in 
WEP, where predicates can be application IDs, source 
and/or destination IP addresses/ports. 

Finally, we have also implemented the client-side of 
EndRE on mobile smartphones running the Windows 


USENIX Association 


USENIX Association 


Trace Name Unique Dates (Total Days) Size 
sie | cients |_| rm 
Small Enterprise 29-39 07/28/08 - 08/08/08 (11) 0.5 
Sista |_| tome-t2nopsa) | 
Medium Enterprise 62-91 07/28/08 - 08/08/08 (11) 15 

Gites | | ttre t2r008 03) 


Large Enterprise 101-210 07/28/08 - 08/08/08 (11) 3 
ies |_| toms-t2nopses) | 

Large Research Lab 125 06/23/08 - 07/03/08 (11) 1 
[Gicikminginy | | 





Table 2: Data trace characteristics (11 sites) 


Mobile 6 OS. However, since Windows Mobile 6 does 
not support Windows Filtering Platform, we have im- 
plemented the functionality as a user-level proxy. 


7 Evaluation approach 


We use a combination of trace-based and testbed evalu- 
ation to study EndRE. In particular, we quantify band- 
width savings and evaluate scalability aspects of EndRE 
using enterprise network traffic traces; we use a testbed 
to quantify processing speed and evaluate latency and 
energy savings. We also report results from a small pilot 
deployment (15 laptops) in our lab spanning | week. 

Traces: Our trace-based evaluation is based on full 
packet traces collected at the WAN access link of 11 cor- 
porate enterprise locations. The key characteristics of 
our traces are shown in Table 2. We classify the enter- 
prises as small, medium or large based on the number of 
internal host IP addresses seen (less than 50, 50-100, and 
100-250, respectively) in the entire trace at each of these 
sites. While this classification is somewhat arbitrary, we 
use this division to study if the benefits depend on the 
size of an enterprise. Note that the total amount of traffic 
in each trace is approximately correlated to the number 
of host IP addresses, though there is a large amount of 
variation from day to day. Typical incoming traffic num- 
bers for small enterprises varied from 0.3-10GB/day, for 
medium enterprises from 2-12GB/day and for large en- 
terprises from 7-5OGB/day. The access link capacities 
at these sites varied from a few Mbps to several tens of 
Mbps. The total size of traffic we study (including in- 
bound/outbound traffic and headers) is about 6TB. 

Testbed: Our testbed consists of a desktop server con- 
nected to a client through a router. In wireline experi- 
ments, the router is a dual-homed PC capable of emu- 
lating links of pre-specified bandwidth and latency. In 
wireless experiments, the router is a WiFi access point. 
The server is a desktop PC running Windows Server 
2008. The client is a desktop PC running Windows Vista 
or Windows 7 in the wireline experiments, and a Sam- 
sung mobile smartphone running Windows Mobile 6 in 
the wireless experiments. 


8 Costs 


In this section, we quantify the CPU and memory costs 
of our implementation of EndRE. Though our evalua- 
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Figure 6: Max-Match processing speed 


tion focus largely on Max-Match, we also provide a brief 
analysis of Chunk-Match. 


8.1 CPU Costs 


Micro-benchmarks: We first focus on _ micro- 
benchmarks for different fingerprinting algorithms using 
Max-Match for a cache size of 1OMB between a given 
client-server pair (we examine cache size issues in de- 
tail in 88.2). Table 3 presents a profiler-based analysis 
of the costs of the three key processing steps on a sin- 
gle large packet trace as measured on a 2GHz 64-bit In- 
tel Xeon processor. The fingerprinting step is responsi- 
ble for identifying the markers/fingerprints; the Inline- 
Match function is called as fingerprints are generated; 
and the Admin function is used for updating the finger- 
print store. Of these steps, only the fingerprinting step is 
distinct for the algorithms, and is also the most expen- 
sive. 

One can clearly see that fingerprinting is expensive for 
MODP and is largely independent of p. Fingerprinting 
for MAXP is also expensive but we see that as p is in- 
creased, the cost of fingerprinting comes down. In the 
case of FIXED and SAMPLEBYTE, as expected, fin- 
gerprinting cost is low, with significant reductions as p 
is increased. 

Finally, note that the optimizations detailed earlier for 
updating the fingerprint store in Max-Match result in 
low cost for the Admin function in all the algorithms. 
Since matching and fingerprinting are interleaved [20], 
the cost of fingerprinting and matching functions, and 
hence total processing speed, depend on the redundancy 
of a particular trace. We next compute the average pro- 
cessing speed for the different algorithms over a large 
set of traces. 

Trace analysis: Figure 6 plots the average processing 
speed in Gbps at the server for Max-Match for different 
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Figure 7: Max-Match bandwidth savings 


fingerprinting algorithms, while Figure 7 plots the aver- 
age bandwidth savings. We assume a packet cache size 
of OMB. We use the 11-day traces for sites 1-10 in Ta- 
ble Z. 

We make a number of observations from these figures. 
First, the processing speed of MODP is about 0.4Gbps 
and, as discussed in 85, is largely unaffected by p. Pro- 
cessing speed for MAXP ranges from 0.6 — 1.7Gbps, 
indicating that the CPU overhead can be decreased by 
increasing p. As expected, FIXED delivers the highest 
processing speed, ranging from 2.3 — 7.1Gbps since it 
incurs no cost for marker identification. Finally, SAM- 
PLEBYTE delivers performance close to FIXED, rang- 
ing from 2.2 —5.8Gbps, indicating that the cost of identi- 
fication based on a single byte is low. Second, examining 
the compression savings, the curves for MODP, MAXP, 
and SAMPLEBYTE in Figure 7 closely overlap for the 
most part with SAMPLEBYTE under-performing the 
other two only when the sampling period is high (at 
p = 512, it appears that the choice of markers based 
on a single-byte may start to lose effectiveness). On 
the other hand, FIXED significantly under-performs the 
other three algorithms in terms of compression savings, 
though in absolute terms, the saving from FIXED are 
surprisingly high. 

While the above results were based on a cache size of 
1OMB, typical for EndRE, a server is likely to have mul- 
tiple simultaneous such connections in operation. Thus, 
in practice, it is unlikely to benefit from having benefi- 
cial CPU cache effects that the numbers above portray. 
We thus conducted experiments with large cache sizes 
(1-2GB) and found that processing speed indeed falls 
by about 30% for the algorithms. Taking this overhead 
into account, SAMPLEBYTE provides server process- 
ing speeds of 1.5 — 4Gbps. To summarize, SAMPLE- 
BYTE provides just enough randomization for identifica- 
tion of chunk markers that allows it to deliver the com- 
pression savings of MODP/MAXP while being inexpen- 
sive enough to deliver processing performance, similar 
to FIXED, of 1.5 — 4Gbps. 

In the case of Chunk-Match, the processing speed (not 
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Bandwidth Savings (%) 





EndRE Cache Size 


Figure 8: Cache size vs overall bandwidth savings 


shown) is only 0.1-0.2Gbps. This is mainly due to SHAI 
hash computation (85.2.1) and the inability to use the 
fingerprint store optimizations of Max-Match (85.2.2). 
We are examining if a cheaper hash function coupled 
with an additional mechanism to detect collision and re- 
cover payload through retransmissions will improve per- 
formance without impacting latency. 

Client Decompression: The processing cost for decom- 
pression at the end host client is negligible since EndRE 
decoding is primarily a memory lookup in the client’s 
cache; our decompression speed is 10Gbps. We exam- 
ine the impact of this in greater detail when we evaluate 
end-system energy savings from EndRE in 89.3. 


8.2 Memory Costs 


Since EndRE requires a cache per communicating 
client-server pair, quantifying the memory costs at both 
clients and servers is critical to estimating the scalability 
of the EndRE system. In the next two sections, we an- 
swer the following two key questions: 1) what cache size 
limit do we provision for the EndRE service between a 
single client-server pair? 2) Given the cache size limit 
for one pair, what is the cumulative memory requirement 
at clients and servers? 


8.2.1 Cache Size versus Savings 


To estimate the cache size requirements of EndRE, we 
first need to understand the trade-off between cache sizes 
and bandwidth savings. For the following discussion, 
unless otherwise stated, by cache size, we refer to the 
client cache size limit for EndRE service with a given 
server. The server cache size can be estimated from 
this value depending on whether Max-Match or Chunk- 
Match is used (85). Further, while one could provision 
different cache size limits for each client-server pair, for 
administrative simplicity, we assume that cache size lim- 
its are identical for all EndRE nodes. 

Figure 8 presents the overall bandwidth savings ver- 
sus cache size for the EndRE service using the Max- 
Match approach (averaged across all enterprise links). 
Although not shown, the trends are similar for the 
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Figure 9: Cache size vs protocol bandwidth savings 


Chunk-Match approach. Based on the figure, a good op- 
erating point for EndRE is at the knee of the curve corre- 
sponding to 81OMB of cache, allowing for a good trade- 
off between memory resource constraints and bandwidth 
Savings. 


Figure 9 plots the bandwidth savings versus cache size 
(in log-scale for clarity) for different protocols. For this 
trace set, HTTP (port 80,8080) comprised 45% of all 
traffic, SMB (port 445) and NetBios File sharing (port 
139) together comprised 26%, LDAP (port 389) was 
about 2.5% and a large set of protocols, labeled as OTH- 
ERS, comprised 26.5%. While different protocols see 
different bandwidth savings, all protocols, except OTH- 
ERS, see savings of 20+% with LDAP seeing the highest 
savings of 56%. Note that OTHERS include several pro- 
tocols that were encrypted (HTTPS:443, Remote Desk- 
top:3389, SIP over SSL:5061, etc.). For this analysis, 
since we are estimating EndRE savings from [P-level 
packet traces whose payload is already encrypted, En- 
dRE sees 0% savings. An implementation of EndRE in 
the socket layer would likely provide higher savings for 
protocols in the OTHERS category than estimated here. 
Finally, by examining the figure, one can see the “knee- 
of-the-curve” at different values of cache size for differ- 
ent protocols (LOMB for HTTP, 4MB for SMB, 500KB 
for LDAP, etc.). This also confirms that the 1OMB knee 
of Figure 8 is largely due to the 1OMB knee for HTTP in 
Figure 9. 


This analysis suggests that the cache limit could be 
tuned depending on the protocol(s) used between a 
client-server pair without significantly impacting over- 
all bandwidth savings. Thus, we use OMB cache size 
only if HTTP traffic exists between a client-server pair, 
4MB if SMB traffic exists, and a default IMB cache size 
otherwise. Finally, while this cache size limit is derived 
based on static analysis of the traces, we are looking at 
designing dynamic cache size adaptation algorithms for 
each client-server pair as part of future work. 
























































0 | | 0 


0 100 200 300 0 500 1000 1500 2000 2500 


Maximum Cache Size at Client (MB) Maximum Cache Size at Server (MB) 
(a) Client (b) Server 
Figure 10: Cache scalability 


8.2.2 Client and Server Memory Costs 


Given the cache size limits derived in the previous sec- 
tion, we now address the critical question of EndRE scal- 
ability based on the cumulative cache needs at the client 
and server for all their connections. Using the entire set 
of network traces of ten enterprise sites (44 days, STB) 
in Table 2, we emulate the memory needs of EndRE with 
the above cache size limits for all clients and servers. We 
use a conservative memory page-out policy in the emu- 
lation: if there has been no traffic for over ten hours be- 
tween a client-server pair, we assume that the respective 
EndRE caches at the nodes are paged to disk. For each 
node, we then compute the maximum in-memory cache 
needed for EndRE over the entire 44 days. 


Figure 10(a) plots the CDF of the client’s maximum 
EndRE memory needs for all (~ 1000) clients. We find 
that the median (99 percentile) EndRE client allocates a 
maximum cache size of 6|OMB (275MB) during its oper- 
ation over the entire 44-day period. We also performed 
an independent study of desktop memory availability by 
monitoring memory availability at 1 minute intervals for 
110 desktops over | month at one of the enterprise sites. 
Analyzing this data, we found that the 5, 50 and 90°" 
percentile values of unused memory, available for use, 
at these enterprise desktops were 1994MB, 873MB, and 
245MB, respectively. This validates our hypothesis that 
desktop memory resources are typically adequately pro- 
visioned in enterprises, allowing EndRE to operate on 
clients without significant memory installation costs. 


We now examine the size of the cache needed at the 
server. First, we focus on Max-Match and study the net 
size of the cache required across all active clients at the 
server. Using the same enterprise trace as above, we plot 
the CDF of server cache size for all the servers in the 
trace in Figure 10(b). From the figure, we find that the 
maximum cache requirement is about 2GB. If it is not 
feasible to add extra memory to servers, say due to cost 
or slot limitations, the Chunk-Match approach could be 
adopted instead. This would reduce the maximum cache 
requirement by 3X (85). 
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Table 4: Percentage bandwidth savings on incoming links to 10 enterprise sites over 11 day trace 


9 Benefits 


We now characterize various benefits of EndRE. We first 
investigate WAN bandwidth savings. We then quan- 
tify latency savings of using EndRE, especially on short 
transactions typical of HTTP. Finally, we quantify en- 
ergy savings on mobile smartphones, contrasting EndRE 
with prior work on energy-aware compression [11]. 


9.1 Bandwidth Savings 


In this section, we focus on the bandwidth savings of 
different RE algorithms for each of the enterprise sites, 
and examine the gains of augmenting EndRE with GZIP 
and DOT [21]. We also present bandwidth savings of an 
IP WAN optimizer for reference. 

Table 4 compares the bandwidth savings on incom- 
ing links to ten enterprise sites for various approaches. 
This analysis is based on packet-level traces and while 
operating at packet sizes or larger buffers make little dif- 
ference to the benefits of EndRE approaches, buffer size 
can have a significant impact on GZIP-style compres- 
sion. Thus, in order to emulate the benefits of perform- 
ing GZIP at the socket layer, we aggregate consecutive 
packet payloads for up to 10ms and use this aggregated 
buffer while evaluating the benefits of GZIP. For EndRE, 
we use cache sizes of up to LOMB. We also emulate an 
[P-layer middlebox-based WAN optimizer with a 2GB 
cache. 

We observe the following: First, performing GZIP 
in isolation on packets aggregated for up to 10ms pro- 
vides per-site savings of 13% on average. Further, there 
are site specific variations; in particular, GZIP performs 
poorly for site 1 compared to other approaches. Second, 
comparing the four fingerprinting algorithms (columns 
3-6 in Table 4), we see that MODP, MAXP, and SAM- 
PLEBYTE deliver similar average savings of 25-26% 
while FLXED under-performs. In particular, in the case 
of site 1, FIXED significantly under-performs the other 
three approaches. This again illustrates how SAMPLE- 
BYTE captures enough content-specific characteristics 
to significantly outperform FIXED. Adding GZIP com- 
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pression to SAMPLEBYTE improves the average sav- 
ings to 30% (column 7). While the above numbers were 
based on Max-Match, using Chunk-Match instead re- 
duces the savings to 22% (column 8), but this may be a 
reasonable alternative if server memory is a bottleneck. 


We then examine savings when EndRE is augmented 
with DOT [21]. For this analysis, we employ a heuris- 
tic to extract object chunks from the packet traces as 
follows: we combine consecutive packets of the same 
four-tuple flow and delineate object boundaries if there 
is no packet within a time window(1s). In order to ensure 
that the DOT analysis adds redundancy not seen by En- 
dRE, we conservatively add only inter-host redundancy 
obtained by DOT to the EndRE savings. We see that 
(third column from right) DOT improves EndRE savings 
by a further 6-10%, and the per-site average bandwidth 
savings improves to 34%. For reference, a WAN opti- 
mizer with 2GB cache provides per-site savings of 39% 
and if DOT is additionally applied (where redundancy of 
matches farther away than 2GB are only counted), the 
average savings goes up by only 2%. Thus, it appears 
that half the gap between EndRE and WAN optimizer 
savings comes from inter-host redundancy and the other 
half from the larger cache used by the WAN optimizer. 


Summarizing, EndRE using the Max-Match approach 
with the SAMPLEBYTE algorithm provides average 
per-site savings of 26% and delivers two-thirds of the 
savings of a IP-layer WAN optimizer. When DOT is 
applied in conjunction, the average savings of EndRE 
increase to 34% and can be seen to be approaching the 
41% savings of the WAN optimizer with DOT. 


Pilot Deployment: We now report results from a small 
scale deployment. EndRE was deployed on 15 desk- 
top/laptop clients (11 users) and one server for a period 
of about 1 week (09/25/09 to 10/02/09) in our lab. We 
also hosted a HTTP proxy at the EndRE server and users 
manually enabled/disabled the use of this proxy, at any 
given time, using a client-based software. During this 
period, a total of 1.7GB of HTTP traffic was delivered 
through the EndRE service with an average compression 
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Table 5: HTTP latency gain for different RTTs 





of 31.2%. A total of 159K TCP connections were ser- 
viced with 72 peak active simultaneous TCP connections 
and peak throughput of 18.4Mbps (WAN link was the 
bottleneck). The CPU utilization at the server remained 
within 10% including proxy processing. The number of 
packet re-orderings was less than 1% even in the pres- 
ence of multiple simultaneous TCP connections between 
client and server. We also saw a large number of TCP 
RSTs but, as reported in [10], these were mostly in lieu 
of TCP FINs and thus do not contribute to cache syn- 
chronization issues. Summarizing, even though this is 
a small deployment, the overall savings numbers match 
well with our analysis results and the ease of deployment 
validates the choice of implementing EndRE over TCP. 


9.2 Latency Savings 


In this section, we evaluate the latency gains from de- 
ploying EndRE. In general, latency gains are possible 
for a number of reasons. The obvious case is due to 
reduction of load on the bottleneck WAN access link 
of an enterprise. Latency gains may also arise from 
the choice of implementing EndRE at the socket layer 
above TCP. Performing RE above the TCP layer helps 
reduce the amount of data transferred and thus the num- 
ber of TCP round-trips necessary for connection com- 
pletion. In the case of large file transfers, since TCP 
would mostly be operating in the steady-state conges- 
tion avoidance phase, the percentage reduction in data 
transfer size translates into a commensurate reduction in 
file download latency. Thus, for large file transfers, say, 
using SMB or HTTP, one would expect latency gains 
similar to the average bandwidth gains seen earlier. 

Latency gains in the case of short data transfers, typ- 
ical of HTTP, are harder to estimate. This is because 
TCP would mostly be operating in slow-start phase and 
a given reduction in data transfer size could translate into 
a reduction of zero or more round trips depending on 
many factors including original data size and whether or 
not the reduction occurs uniformly over the data. 

In order to quantify latency gains for short file trans- 
fers, we perform the following experiment. From the 
enterprise network traces, we extract HTTP traffic that 
we then categorize into a series of session files. Each 
session file consists of a set of timestamped operations 
starting with a connect, followed by a series of sends and 
receives (1.e., transactions), and finally a close. 

The session files are then replayed on a testbed con- 
sisting of a client and a server connected by a PC-based 
router emulating a high bandwidth, long latency link, us- 
ing the mechanism described in [13]. During the replay, 
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strict timing is enforced at the start of each session based 
on the original trace; in the case of transactions, timing 
between the start of one transaction and the start of the 
next transaction is preserved as far as possible. The per- 
formance metric of interest is latency gain which is de- 
fined as the ratio of reduction in transaction time due to 
EndRE to transaction time without EndRE. 

Table 5 shows the latency gain for HTTP for various 
transactions sorted by the number of round-trips in the 
original trace. For this trace, only 40% of HTTP trans- 
actions involved more than one round trip. For these 
transactions, latency gains on average ranged from 20% 
to 35%. These gains are comparable with the average 
bandwidth savings due to EndRE for this trace (~30%), 
demonstrating that even short HTTP transactions see la- 
tency benefits due to RE. 


9.3. Energy Savings 


We study the energy and bandwidth savings achieved us- 
ing EndRE on Windows Mobile smartphones and com- 
pare it against both no compression as well as prior work 
on energy-aware compression [11]. In [11], the authors 
evaluate different compression algorithms and show that 
ZLIB performs best in terms of energy savings on re- 
source constrained devices for decompression. We eval- 
uate the energy and bandwidth gains using two trace 
files. Traces A and B are 20MB and I5MB in size, 
respectively, and are based on enterprise HTTP traffic, 
with trace B being more compressible than trace A. 

We first micro-benchmark the computational cost of 
decompression for ZLIB and EndRE. To do this, we 
load pre-compressed chunks of the traces in the mobile 
smartphone’s memory and turn off WiFi. We then re- 
peatedly decompress these chunks and quantify the en- 
ergy cost. Figures 11(a) and (b) plot the average com- 
pression savings and energy cost of in-memory decom- 
pression for various chunk sizes, respectively. The en- 
ergy measurements are obtained using a hardware-based 
battery power monitor [4] that is accurate to within 1mA. 
From these figures, we make two observations. First, 
as the chunk size is increased, ZLIB compression sav- 
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Table 6: Energy savings on a mobile smartphone 


ings increase and the energy cost of decompression de- 
creases. This implies that ZLIB is energy efficient when 
compressing large chunks/files. Second, the compres- 
sion savings and energy costs of EndRE, as expected, 
are independent of chunk size. More importantly, En- 
dRE delivers comparable compression savings as ZLIB 
while incurring an energy cost of 30-60% of ZLIB. 

We now compare the performance of ZLIB and En- 
dRE to the case of no compression by replaying the 
traces over WiFi to the mobile smartphone and perform- 
ing in-memory decompression on the phone. In the case 
of ZLIB, we consider two cases: packet-by-packet com- 
pression and bulk compression where 32KB blocks of 
data are compressed at a time, the latter representing a 
bulk download case. After decompression, each packet 
is consumed in memory and not written to disk; this 
allows us to isolate the energy cost of communication. 
If the decompressed packet is written to disk or further 
computation is performed on the packet, the total energy 
consumed for all the scenarios will be correspondingly 
higher. 

Table 6 shows energy and compression gains of using 
ZLIB and EndRE as compared to using no compression. 
We see that when ZLIB is applied on a packet-by-packet 
basis, even though it saves bandwidth, it results in in- 
creased energy consumption (negative energy savings). 
This is due to the computational overhead of ZLIB de- 
compression. On the other hand, for larger chunk sizes, 
the higher compression savings coupled with lower com- 
putational overhead (Figure 11) result in good energy 
savings for ZLIB. In the case of EndRE, we find that 
the bandwidth savings directly translate into comparable 
energy savings for communication. This suggests that 
EndRE is a more energy-efficient solution for packet- 
by-packet compression while ZLIB, or EndRE coupled 
with ZLIB, work well for bulk compression. 


10 Conclusion 


Using extensive traces of enterprise network traffic and 
testbed experiments, we show that our end-host based 
redundancy elimination service, EndRE, provides aver- 
age bandwidth gains of 26% and, in conjunction with 
DOT, the savings approach that provided by a WAN op- 
timizer. Further, EndRE achieves speeds of 1.5-4Gbps, 
provides latency savings of up to 30% and translates 
bandwidth savings into comparable energy savings on 
mobile smartphones. In order to achieve these benefits, 
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EndRE utilizes memory and CPU resources of end sys- 
tems. For enterprise clients, we show that median mem- 
ory requirements for EndRE is only 60MB. At the server 
end, we design mechanisms for working with reduced 
memory and adapting to CPU load. 

Thus, we have shown that the cleaner semantics of 
end-to-end redundancy removal can come with consider- 
able performance benefits and low additional costs. This 
makes EndRE a compelling alternative to middlebox- 
based approaches. 
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Abstract 


We show how to build cheap and large CAMs, or 
CLAMs, using a combination of DRAM and flash mem- 
ory. These are targeted at emerging data-intensive net- 
worked systems that require massive hash tables running 
into a hundred GB or more, with items being inserted, 
updated and looked up at a rapid rate. For such systems, 
using DRAM to maintain hash tables is quite expen- 
sive, while on-disk approaches are too slow. In contrast, 
CLAMs cost nearly the same as using existing on-disk 
approaches but offer orders of magnitude better perfor- 
mance. Our design leverages an efficient flash-oriented 
data-structure called BufferHash that significantly lowers 
the amortized cost of random hash insertions and updates 
on flash. BufferHash also supports flexible CLAM evic- 
tion policies. We prototype CLAMs using SSDs from 
two different vendors. We find that they can offer aver- 
age insert and lookup latencies of 0.006ms and 0.06ms 
(for a 40% lookup success rate), respectively. We show 
that using our CLAM prototype significantly improves 
the speed and effectiveness of WAN optimizers. 


1 Introduction 


In recent years, a number of data-intensive networked 
systems have emerged where there is a need to maintain 
hash tables as large as tens to a few hundred gigabytes in 
size. Consider WAN optimizers [1, 2, 7, 8], for example, 
that maintain “data fingerprints” to aid in identifying and 
eliminating duplicate content. The fingerprints are 32- 
64b hashes computed over ~4-8KB chunks of content. 
The net size of content is ~10TB stored on disk [9]. Thus 
the hash table storing the mapping from fingerprints to 
on-disk addresses of data chunks could be >32GB. Just 
storing the fingerprints requires ~ 16GB. 

The hash tables in these content-based systems are 
also inserted into, looked up and updated frequently. For 
instance, a WAN optimizer connected to a 0.5Gbps link 
may require roughly 10,000 fingerprint lookups, inser- 
tions and updates each per second. Other examples of 
systems that employ similar large hash tables include 
data deduplication systems [4, 45], online backup ser- 
vices [5], and directory services in data-oriented network 
architectures [32, 37, 42]. These systems are becoming 
increasingly popular and being widely adopted [3]. 

This paper arises from the quest to design effective 
hash tables in these systems. The key requirement is that 
the mechanisms used be cost-effective for the function- 


ality they support. That is, the mechanisms should of- 
fer a high number of hash table operations (> 104’) per 
second while keeping the overall cost low. We refer to 
mechanisms that satisfy these requirements as CLAMs, 
for cheap and large CAMs. 

There are two possible approaches today for support- 
ing the aforementioned systems. The first is to maintain 
hash tables in DRAM which can offer very low latencies 
for hash table operations. However, provisioning large 
amounts of DRAM is expensive. For instance, a 128GB 
RamSan DRAM-SSD offers 300K random [Os per sec- 
ond, but, it has a very high cost of ownership, includ- 
ing the device cost of $120K and an energy footprint of 
650W [20]. In other words, it can support fewer than 2.5 
hash table operations per second per dollar. 

A much cheaper alternative is to store the hash ta- 
bles on magnetic disks using database indexes, such as 
Berkeley-DB (or BDB) [6]. However, poor throughput 
of random inserts, lookups, and updates in BDB can 
severely undermine the effectiveness of the aforemen- 
tioned systems and force them to run at low speeds. For 
example, a BDB-based WAN optimizer is effective for 
link speeds of only up to 1OMbps (88). Note that exist- 
ing fast stream databases [22, 11, 14] and wire-speed data 
collection systems [24, 29] are not suitable as CLAMs as 
they do not include any archiving and indexing schemes. 

In this paper we design and evaluate an approach that 
is 1-2 orders of magnitude better in terms of hash ope- 
rations/sec/$ compared to both disk-based and DRAM- 
based approaches. Our approach uses a commodity 
two-level storage/memory hierarchy consisting of some 
DRAM and a much larger amount of flash storage (could 
be flash memory chips or solid state disks (SSDs)). Our 
design consumes most of the I/Os in the DRAM, giv- 
ing low latency and high throughput I/Os compared to a 
flash-only design. On the other hand, using flash allows 
us to support a large hash table in a cheaper way than 
DRAM-only solutions. We choose flash over magnetic 
disks for its many superior properties, such as, higher I/O 
per second per dollar and greater reliability, as well as far 
superior power efficiency compared to both DRAM and 
magnetic disks. Newer generation of SSDs are rapidly 
getting bigger and cheaper. Configuring our design with 
4GB of memory and 80GB of flash, for instance, costs as 
little as $400 using current hardware. 

Despite flash’s attractive I/O properties, building a 
CLAM using flash is challenging. In particular, since 
the available DRAM is limited, a large part of the hash 
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table must be stored in flash (unlike recent works, e.g., 
FAWN [13], where the hash index is fully in DRAM). 
Thus, hash insertion would require random I/Os, which 
are expensive on flash. Moreover, the granularity of a 
flash I/O is orders of magnitude bigger than that of an 
individual hash table operation in the systems we target. 
Thus, unless designed carefully, the CLAM could per- 
form poorer than a traditional disk-based approach. 

To address these challenges, we introduce a novel data 
structure, called BufferHash. BufferHash represents a 
careful synthesis of prior ideas along with a few novel 
algorithms. A key idea behind BufferHash is that instead 
of performing individual random insertions directly on 
flash, DRAM can be used to buffer multiple insertions 
and writes to flash can happen in a batch. This shares the 
cost of a flash I/O operation across multiple hash table 
operations, resulting in a better amortized cost per op- 
eration. Like a log-structured file system [39], batches 
are written to flash sequentially, the most efficient write 
pattern for flash. The idea of batching operations to 
amortize I/O costs has been used before in many sys- 
tems [15, 28]. However using it for hash tables is novel, 
and it poses several challenges for flash storage. 

Fast lookup: With batched writes, a given (key, value) 
pair may reside in any prior batch, depending on the 
time it was written out to flash. A naive lookup al- 
gorithm would examine all batches for the key, which 
would incur high and potentially unacceptable flash I/O 
costs. To reduce the overhead of examining on-flash 
batches, BufferHash (1) partitions the key space to limit 
the lookup to one partition, instead of the entire flash 
(similar to how FAWN spreads lookups across multiple 
“wimpy nodes’) [13]), and (ii) uses in-memory Bloom 
filters (as Hyperion [23] does) to efficiently determine a 
small set of batches that may contain the key. 

Limited flash: In many of the streaming applications 
mentioned earlier, insertion of new (key, value) entries 
into the CLAM requires creating space by evicting old 
keys. BufferHash uses a novel age-based internal or- 
ganization that naturally supports bulk evictions of old 
entries in an I/O-efficient manner. BufferHash also sup- 
ports other flexible eviction policies (e.g. priority-based 
removal) to match different application needs, albeit at 
additional performance cost. Existing proposals for in- 
dexing archived streaming data ignore eviction entirely. 
Updates: Since flash does not support efficient update or 
deletion, modifying existing (key, value) mappings in 
situ 1s expensive. To support good update latencies, we 
adopt a lazy update approach where all value mappings, 
including deleted or updated ones, are temporarily left 
on flash and later deleted in batch during eviction. Such 
lazy updates have been previously used in other contexts, 
such as buffer-trees [15] and lazy garbage collection in 
log-structured file systems [39]. 
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Performance tuning: The unique I/O properties of flash 
demand careful choice of various parameters in our de- 
sign of CLAMs, such as the amount of DRAM to use, 
and the sizes of batches and Bloom filters. Suboptimal 
choice of these parameters may result in poor overall 
CLAM performance. We model the impact of these pa- 
rameters on latencies of different hash table operations 
and show how to select the optimal settings. 

We build CLAM prototypes using SSDs from two ven- 
dors. Using extensive analysis based on a variety of 
workloads, we study the latencies supported in each case 
and compare the CLAMs against popular approaches 
such as using Berkeley-DB (BDB) on disk. In particu- 
lar, we find that our Intel SSD-based CLAM offers an 
average insert latency of 0.006ms compared to 7ms from 
using BDB on disk. For a workload with 40% hit rate, 
the average lookup latency is 0.06ms for this CLAM, 
but 7ms for BDB. Thus, our CLAM design can yield 
42 lookups/sec/$ and 420 insertions/sec/$ which is 1-2 
orders of magnitude better than RamSan DRAM-SSD 
(2.5 hash operations/sec/$). The superior energy effi- 
ciency of flash and rapidly declining prices compared to 
DRAM [21] mean that the gap between our CLAM de- 
sign and DRAM -based solutions is greater than indicated 
in our evaluation and likely to widen further. Finally, us- 
ing real traces, we study the benefits of employing the 
CLAM prototypes in WAN optimizers. Using a CLAM, 
the speed of a WAN optimizer can be improved > 10X 
compared to using BDB (a common choice today [2]). 

Our CLAM design marks a key step in building 
fast and effective indexing support for high-performance 
content-based networked systems. We do not claim 
that our design is final. We speculate that there may 
be smarter data structures and algorithms, that perhaps 
leverage newer memory technologies (e.g. Phase Change 
Memory), offering much higher hash operations/sec/$. 


2 Related Work 


In this section, we describe prior work on designing data 
structures for flash and recent proposals for supporting 
data-intensive streaming networked systems. 

Data structures for flash: Recent work has shown 
how to design efficient data structures on flash mem- 
ory. Examples include MicroHash [44], a hash table and 
FlashDB [36], a B-Tree index. Unlike BufferHash, these 
data structures are designed for memory-constrained em- 
bedded devices where the design goal is to optimize en- 
ergy usage and minimize memory footprint—latency is 
typically not a design goal. For example, a lookup oper- 
ation in MicroHash may need to follow multiple pointers 
to locate the desired key in a chain of flash blocks and can 
be very slow. Other recent works on designing efficient 
codes for flash memory to increase its effective capac- 
ity [30, 27] are orthogonal to our work, and BufferHash 
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can be implemented on top of these codes. 

A flash-based key-value store: Closely related to our 
design of CLAMs is the recent FAWN proposal [13]. 
FAWN-KYV is a clustered key-value storage built on a 
large number of tiny nodes that each use embedded 
processors and small amounts of flash memory. There 
are crucial differences between our CLAM design and 
FAWN. First, FAWN assumes that each wimpy node can 
keep its hash index in DRAM. In contrast, our design tar- 
gets situations where the actual hash index is bigger than 
available DRAM and hence part of the index needs to be 
stored in flash. In this sense, our design is complemen- 
tary to FAWN; if the hash index in each wimpy node gets 
bigger than its DRAM, it can use BufferHash to organize 
the index. Second, being a cluster-based solution, FAWN 
optimizes for throughput, not for latency. As the evalu- 
ation of FAWN shows, some of the lookups can be very 
slow (> 500ms). In contrast, our design provides better 
worst-case latency (< 1ms), which is crucial for systems 
such as WAN optimizers. Finally, FAWN-KV does not 
focus on efficient eviction of indexed data. 

Along similar lines is HashCache [16], a cache that 
can run on cheap commodity laptops. It uses an in- 
memory index for objects stored on disk. Our approach 
is complementary to HashCache, just as it is with FAWN. 

DRAM-only solutions: DRAM-SSDs provide ex- 
tremely fast I/Os, at the cost of high device cost and 
energy footprint. For example, a 128GB RamSan de- 
vice can support 300K IOPS, but costs 120$ and con- 
sumes 650W [20]. A cheaper alternative from Vio- 
lin memory supports 200K IOPS, but still costs around 
504 $ [40]. Our CLAM prototypes significantly outper- 
form traditional hash tables designed in these DRAM- 
SSDs in terms of operations/s/$. 

Large scale streaming systems: Hyperion [23] en- 
ables archival, indexing, and on-line retrieval of high- 
volume data streams. However, Hyperion does not suit 
the applications we target as it does not offer CAM-like 
functionality. For example, to lookup a key, Hyperion 
may need to examine prohibitively high volume of data 
resulting in a high latency. Second, it does not consider 
using flash storage, and hence does not aim to optimize 
design parameters for flash. Finally, it does not support 
efficient update or eviction of indexed data. 

Existing data stream systems [11, 14, 22] do not sup- 
port queries over archived data. StreamBase [41] sup- 
ports archiving data and processing queries over past 
data; but the data is archived in conventional hash or 
B-Tree-indexed tables, both of which are slow and are 
suitable only for offline queries. Endace DAG [24] 
and CoMo [29] are designed for wire-speed data collec- 
tion and archiving; but they provide no query interface. 
Existing DBMSs can support CAM-like functionalities. 
However, they are designed neither for high update and 


lookup rates (see [14]) nor for flash storage (see [36]). 

3 Motivating Applications 

In this section, we describe three networked systems that 
could benefit from effective mechanisms for building and 
maintaining large hash tables that can be written to and 
looked up at a very fast rate. 

WAN optimization. WAN optimizers [1, 2, 8, 7] are 
used by enterprises and data centers to improve network 
utilization by looking for and suppressing redundant in- 
formation in network transfers. A WAN optimizer com- 
putes fingerprints of each arriving data object and looks 
them up in a hash table of fingerprints found in prior con- 
tent. The fingerprints are 32-64b hashes computed over 
~4-8KB data chunks. Upon finding a match, the cor- 
responding duplicate content is removed, and the “com- 
pressed object” is transmitted to the destination, where 
it gets reconstructed. Fingerprints for the original object 
are inserted into the index to aid in future matches. The 
content is typically >10TB in net size [10]. Thus the 
fingerprint hash tables could be >32GB. 

Consider a WAN optimizer connected to a heavily- 
loaded 500Mbps link. Roughly 10,000 content finger- 
prints are created per second. Depending on the imple- 
mentation, three scenarios may arise during hash inser- 
tion and lookup: (1) lookups for upcoming objects are 
held-up until prior inserts complete, or (2) upcoming ob- 
jects are transmitted without fingerprinting and lookup, 
or (3) insertions are aborted mid-way and upcoming ob- 
jects looked up against an “incomplete index.” Fast sup- 
port for insertions and lookups can improve all three situ- 
ations and help identify more content redundancy. In 98, 
we show that a BDB-based WAN optimizer can function 
effectively only at low speeds (1LOMbps) due to BDB’s 
poor support for random insertions and lookups, even 
if BDB is maintained on an Intel SSD. A CLAM-based 
WAN optimizer using a low-end transcend SSD that is 
an order of magnitude slower than an Intel SSD is highly 
effective even at 200-300Mbps. 

Data deduplication and _ backup. Data de- 
duplication [4] is the process of suppressing duplicate 
content from enterprise data leaving only one copy of 
the data to be stored for archival. Prior work suggests 
that data sets in de-dup systems could be roughly 8-10TB 
and employ 20GB indexes [4, 45]. 

A time-consuming activity in deduplication is merg- 
ing data sets and the corresponding indexes. To merge 
a smaller index into a larger one, fingerprints from the 
latter dataset need to be looked up, and the larger in- 
dex updated with any new information. We estimate that 
merging fingerprints into a larger index using Berkeley- 
DB could take as long as 2hrs. In contrast, our CLAM 
prototypes can help the merge finish in under 2mins. We 
note that a similar set of challenges arise in online backup 
services [5] which allow users to constantly, and in an 
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online fashion, update a central repository with “diffs” 
of the files they are editing, and to retrieve changes from 
any remote location on demand. 

Central directory for a data-oriented network. Re- 
cent proposals argue for a new resolution infrastruc- 
ture to dereference content names directly to host loca- 
tions [32, 37, 42]. The names are hashes computed over 
chunks of content inside data objects. As new sources of 
data arise or as old sources leave the network, the reso- 
lution infrastructure should be updated accordingly. To 
support scalability, the architectures have conventionally 
relied on a distributed resolution mechanism based on 
DHTs [32, 37, 42]. However, in some deployment sce- 
narios (e.g. a large corporation), the resolution may have 
to be provided by a trusted central entity. To ensure high 
availability and throughput for a large user-base, the cen- 
tralized deployment should support fast inserts and effi- 
cient lookups of the mappings. The CLAMs we design 
can support such an architecture effectively. 


4 Flash Storage and Hash Tables 


Flash provides a non-volatile memory store with several 
significant benefits over typical magnetic hard disks such 
as fast random reads (< | ms), power-efficient I/Os (<1 
Watt), better shock resistance, etc. [33, 36]. However, 
because of the unique characteristics of flash storage, ap- 
plications designed for flash should follow a few well- 
known design principles: (P1) Applications should avoid 
random writes, in-place updates, and sub-block deletions 
as they are significantly expensive on flash. For example, 
updating a single 2KB page in-place requires first eras- 
ing an entire erase block (128K B-256KB) of pages, and 
then writing the modified block in its entirety. As shown 
in [35], such operations are over two orders of magni- 
tude slower than sequential writes, out-of-place updates, 
and block deletions respectively, on both flash chips and 
SSDs. (P2) Since reads and writes happen at the granu- 
larity of a flash page (or an SSD sector), an I/O of size 
smaller than a flash page (2KB) costs at least as much 
as a full-page I/O. Thus, applications should avoid small 
I/Os if possible. (P3) The high fixed initialization cost 
of an I/O can be amortized with a large I/O size [12]. 
Thus, applications should batch I/Os whenever possible. 
In designing flash-based CLAMs using BufferHash, we 
follow these design principles. 


A conventional hash table on flash. Before going into 
the details of our BufferHash design, it might be useful 
to see why a conventional hash table on flash is likely to 
suffer from poor performance. Successive keys inserted 
into a hash table are likely to hash to random locations in 
the hash table; therefore, values written to those hashed 
locations will result in random writes, violating the de- 
sign principle P1 above. 

Updates and deletions are immediately applied to a 
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conventional hash table, resulting in in-place updates and 
sub-block deletions (since each hashed value is typically 
much smaller than a flash block), and violation of P1. 

Since each hashed value is much smaller than a flash 
page (or an SSD sector), inserting a single key in an in- 
flash hash table violates principles P2 and P3. Violation 
of these principles results in a poor performance of a con- 
ventional hash table on flash, as we demonstrate in 87. 

One can try to improve the performance by buffering a 
part of the hash table in DRAM and keeping the remain- 
ing in flash. However, since hash operations exhibit neg- 
ligible locality, such a flat partitioning has very little per- 
formance improvement. Recent research has confirmed 
that a memory buffer is practically useless for external 
hashing for a read-write mixed workload [43]. 


5 The BufferHash Data Structure 


BufferHash is a flash-friendly data structure that supports 
hash table-like operations on (key, value) pairs'. The 
key idea underlying BufferHash is that instead of per- 
forming individual insertions/deletions one at a time to 
the hash table on flash, we can perform multiple opera- 
tions all at once. This way, the cost of a flash I/O oper- 
ation can be shared among multiple insertions, resulting 
in a better amortized cost for each operation (similar to 
buffer trees [15] and group commits in DBMS and file 
systems [28]). For simplicity, we consider only insertion 
and lookup operations for now; we will discuss updates 
and deletions later. 

To allow multiple insertions to be performed all at 
once, BufferHash operates in a lazy batched manner: it 
accumulates insertions in small in-memory hash tables 
(called buffers), without actually performing the inser- 
tions on flash. When a buffer fills up, all inserted items 
are pushed into flash in a batch. For I/O efficiency, items 
pushed from a buffer to flash are sequentially written as 
a new hash table, instead of performing expensive up- 
date to existing in-flash hash tables. Thus, at any point 
of time, the flash contains a large number of small hash 
tables. During lookup, a set of Bloom filters is used 
to determine which in-flash hash tables may contain the 
desired key, and only those tables are retrieved from 
flash. At a high level, the efficiency of this organiza- 
tion comes from batch I/O and sequential writes during 
insertions. Successful lookup operations may still need 
random page reads, however, random page reads are al- 
most as efficient as sequential page reads in flash. 


5.1 A Super Table 

BufferHash consists of multiple super tables. Each super 
table has three main components: a buffer, an incarnation 
table, and a set of Bloom filters. These components are 


'Ror clarity purposes we note that BufferHash is a data-structure 
while a CLAM is BufferHash applied atop DRAM and flash. 
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Figure 1: A Super Table 


organized in two levels of hierarchy, as shown in Fig- 
ure 1. Components in the higher level are maintained in 
DRAM, while those in the lower level are maintained in 
flash. 


Buffer. This is an in-memory hash table where all newly 
inserted hash values are stored. The hash table can be 
built using existing fast algorithms such as multiple- 
choice hashing [18, 31]. A buffer can hold a fixed max- 
imum number of items, determined by its size and the 
desired upper bound of hash collisions. When the num- 
ber of items in the buffer reaches its capacity, the en- 
tire buffer is flushed to flash, after which the buffer is 
re-initialized for inserting new keys. The buffers flushed 
to flash are called incarnations. 


Incarnation table. This is an in-flash table that contains 
old and flushed incarnations of the in-memory buffer. 
The table contains k& incarnations, where & denotes the 
ratio of the size of the incarnation table and the buffer. 
The table is organized as a circular list, where a new in- 
carnation is sequentially written at the list-head. To make 
space for a new incarnation, the oldest incarnation, at the 
tail of the circular list, is evicted from the table. 

Depending on application’s eviction policy, some 
items in an evicted incarnation may need to be retained 
and are re-inserted into the buffer (details in 85.1.2). 


Bloom filters. Since the incarnation table contains a se- 
quence of incarnations, the value for a given hash key 
may reside in any of the incarnations depending on its in- 
sertion time. A naive lookup algorithm for an item would 
examine all incarnations, which would require reading 
all incarnations from flash. To avoid this excessive I/O 
cost, a super table maintains a set of in-memory Bloom 
filters [19], one per incarnation. The Bloom filter for 
an incarnation is a compact signature built on the hash 
keys in that incarnation. To search for a particular hash 
key, we first test the Bloom filters for all incarnations; if 
any Bloom filter matches, then the corresponding incar- 
nation is retrieved from flash and looked up for the de- 
sired key. Bloom filter-based lookups may result in false 
positive; thus, a match could be indicated even though 
there is none, resulting in unnecessary flash I/O. As the 
filter size increases, the false positive rate drops, result- 
ing in lower I/O overhead. However, since the available 
DRAM is limited, filters cannot be too large in size. We 


examine the tradeoff in 86.4. 

The Bloom filters are maintained as follows: When a 
buffer is initialized after a flush, a Bloom filter is created 
for it. When items are inserted into the buffer, the Bloom 
filter is updated with the corresponding key. When the 
buffer is flushed as an incarnation, the Bloom filter is 
saved in memory as the Bloom filter for that incarnation. 
Finally, when an incarnation is evicted, it’s Bloom filter 
is discarded from memory. 


5.1.1 Super Table Operations 


A super table supports all standard hash table operations. 

Insert. To insert a (key, value) pair, the value is in- 
serted in the hash table in the buffer. If the buffer does not 
have space to accommodate the key, the buffer is flushed 
and written as a new incarnation in the incarnation table. 
The incarnation table may need to evict an old incarna- 
tion to make space. 

Lookup. A key is first looked up in the buffer. If 
found, the corresponding value is returned. Otherwise, 
in-flash incarnations are examined in the order of their 
age until the key is found. To examine an incarnation, 
first its Bloom filter is checked to see if the incarnation 
might include the key. If the Bloom filter matches, the 
incarnation is read from flash, and checked if it really 
contains the key. Note that since each incarnation is in 
fact a hash table, to lookup a key in an incarnation, only 
the relevant part of the incarnation (e.g., a flash page) can 
be read directly. 

Update/Delete. As mentioned earlier, flash does not 
support small updates/deletions efficiently; hence, we 
support them in a lazy manner. Suppose a super table 
contains an item (k,v), and later, the item needs to be 
updated with the item (k, v’). In a traditional hash table, 
the item (hk, v) is immediately replaced with (k,v’). If 
(k,v) is still in the buffer when (k,v") is inserted, we 
do the same. However, if (&,v) has already been writ- 
ten to flash, replacing (k,v) will be expensive. Hence, 
we simply insert (k, v’) without doing anything to (k, v). 
Since the incarnations are examined in order of their age 
during lookup, if the same key is inserted with multiple 
updated values, the latest value (in this example, v’) is 
returned by a lookup. Similarly, for deleting a key k, a 
super table does not delete the corresponding item unless 
it is still in the buffer; rather the deleted key is kept in a 
separate list (or, a small in-memory hash table), which is 
consulted before lookup —if the key is in the delete list, 
it is assumed to be deleted even though it is present in 
some incarnation. Lazy update wastes space on flash, as 
outdated items are left on flash; the space is reclaimed 
during incarnation eviction. 


5.1.2. Incarnation Eviction 


In a streaming application, BufferHash may have to evict 
old in-flash items to make space for new items. The de- 
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cision of what to evict depends on application policy. 
For I/O efficiency, BufferHash evicts items in granu- 
larity of an incarnation. Since each incarnation is an in- 
dependent hash table, discarding a part of it may require 
expensive reorganization of the table and expensive I/O 
to write it back to flash. To this end, BufferHash provides 
two basic eviction primitives. The full discard primitive 
entirely evicts the oldest incarnation. The partial discard 
primitive also evicts the oldest incarnation, but it scans 
through all the items in the incarnation before eviction, 
selects some items to be retained (based on a specified 
policy), and re-inserts them into the buffer. Given these 
two basic primitives, applications can configure Buffer- 
Hash to implement different eviction policies as follows. 


FIFO. The full discard primitive naturally implements 
the FIFO policy. Since items with similar ages (1.e., items 
that are flushed together from the buffer) are clustered in 
the same incarnation, discarding the oldest incarnation 
evicts the oldest items. Commercial WAN optimizers 
work in this fashion [8, 2]. 


LRU. An LRU policy can be implemented via the full 
discard mechanism with one additional mechanism: on 
every use of an item not present in the buffer, the item 
is re-inserted. Intuitively, a recently used item will be 
present in a recent incarnation, and hence it will still 
be present after discarding the oldest incarnation. This 
implementation incurs additional space overhead as the 
same item can be present in multiple incarnations. 


Update-based eviction. With a workload with many 
deletes and updates, BufferHash uses the partial discard 
mechanism to discard items that have been deleted or up- 
dated. The former can be determined by examining the 
in-memory delete list, while the latter can be determined 
by checking the in-memory Bloom filters. 


Priority-based eviction. In a priority-based policy, an 
item is discarded if its priority is less than a threshold 
(the threshold can change over time, as in [35]). It can 
be implemented with the partial discard primitive, where 
an item in the discarded incarnation is re-inserted if its 
current priority is above a threshold. 

The FIFO policy is the most efficient, and the default 
policy in BufferHash. The other policies incur additional 
space and latency overhead due to more frequent buffer 
flushes and re-insertion. 

Note that BufferHash may not be able to strictly fol- 
low an eviction policy other than FIFO if enough slow 
storage is not available. Suppose an item is called live 
if it is supposed to be present in the hash table under a 
given eviction policy (e.g., for the update-based eviction 
policy, the item has not been updated or deleted), and 
dead otherwise. BufferHash is supposed to evict only the 
dead items, and it does so if the flash has enough space 
to hold all live and unevicted dead items. On the other 
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hand, if available flash space is limited and there are not 
enough dead items to evict in order to make room for 
newer items, BufferHash is forced to evict live items in a 
FIFO order. * We note that this sort of behavior is un- 
avoidable in any storage scheme dealing with too many 
items to be fit in a limited amount of storage. 


5.1.3 Bit-slicing with a Sliding Window 


To support efficient Bloom filter lookup, we organize the 
Bloom filters for all incarnations within a super table in 
bit-sliced fashion [26]. Suppose a super table contains 
k incarnations, and the Bloom filter for each incarnation 
has m bits. We store all k Bloom filters as m k-bit slices, 
where the 2’th slice is constructed by concatenating bit 
2 from each of the k Bloom filters. Then, if a Bloom 
filter uses h hash functions, we apply them on the key 
x to get fh bit positions in a Bloom filter, retrieve h bit 
slices at those positions, compute bit-wise AND of those 
slices. Then, the positions of 1-bits in this aggregated 
slice, which can be looked up from a pre-computed table, 
represent the incarnations that may contain the key 2. 

As new incarnations are added and old ones evicted, 
bit slices need to be updated accordingly. A naive ap- 
proach would reset the left-most bits of all m bit-slices 
on every eviction, further increasing the cost of an evic- 
tion operation. To avoid this, we append w extra bits with 
every bit-slice, where w is the size of a word that can 
be reset to O with one memory operation. Within each 
(k+w)-bit-slice, a window of & bits represent the Bloom 
filter bits of & current incarnations, and only these bits 
are used during lookup. After an incarnation is evicted, 
the window is shifted one bit right. Since the bit falling 
off the window is no longer used for lookup, it can be 
left unchanged. When the window has shifted w bits, en- 
tire w-bit words are reset to zero at once, resulting in a 
small amortized cost. The window wraps around after it 
reaches the end of a bit-slice. For lack of space we omit 
the details, which can be found in [12]. 


5.2 Partitioned Super Tables 


Maintaining a single super table is not scalable because 
the buffer and individual incarnations will become very 
large with a large available DRAM. As the entire buffer 
is flushed at once, the flushing operation can take a long 
time. Since flash I/Os are blocking operations, lookup 
operations that go to flash during this long flushing pe- 
riod will block (insertions can still happen as they go to 
in-memory buffer). Moreover, an entire incarnation from 
the incarnation table is evicted at a time, increasing the 
eviction cost with partial discard. 


*With the update-based eviction policy, a live item can also be 
evicted if the in-memory Bloom filter incorrectly concludes that the 
item has been updated. However, the probability is small (equals to the 
false positive rate of the Bloom filters). 
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Figure 2: A BufferHash with multiple super tables 


BufferHash avoids this problem by partitioning the 
hash key space and maintaining one super table for each 
partition (Figure 2): Suppose each hash key has k = 
k, + ko bits; then, BufferHash maintains 22 super ta- 
bles. The first &, bits of a key represents the index of the 
super table containing the key, while the last k2 bits are 
used as the key within the particular super table. 

Partitioning enables using small buffers in super ta- 
bles, thus avoiding the problems caused by a large buffer. 
However, we show in 86.4 that too many partitions 
(1.e., very small buffers) can also adversely affect perfor- 
mance. We show how to choose the number of partitions 
for good performance. For example, we show for flash 
chips that the number of partitions should be such that 
the size of a buffer matches the flash block size. 

BufferHash with multiple super tables can be imple- 
mented on a flash chip by statically partitioning it and 
allocating each partition to a super table. A super table 
writes its incarnations in its partition in a circular way — 
after the last block of the partition is written, the first 
block of the partition is erased and the corresponding in- 
carnations are evicted. However, this approach may not 
be optimal for an SSD, where a Flash Translation Layer 
(FTL) hides the underlying flash chips. Even though 
writes within a single partition are sequential, writes 
from different super tables to different partitions may be 
interleaved, resulting in a performance worse than a sin- 
gle sequential write (see [17] for empirical results). To 
deal with that, BufferHash uses the entire SSD as a single 
circular list and writes incarnations from different super 
tables sequentially, in the order they are flushed to the 
flash. (This is in contrast to the log rotation approach 
of Hyperion [23] that provides FIFO semantics for each 
partition, instead of the entire key space.) Note that par- 
titioning also naturally supports using multiple SSDs in 
parallel, by distributing partitions to different SSDs. This 
scheme, however, spreads the incarnations of a super ta- 
ble all over the SSD. To locate incarnations for a given 
super table, we maintain their flash addresses along with 
their Bloom filters and use the addresses during lookup. 


6 Analysis of Costs 

In this section, we first analyze the I/O costs of insertion 
and lookup operations in CLAMs built using BufferHash 
for flash-based storage, and then use the analytical results 
to determine optimal values of two important parameters 
of BufferHash. We use the notations in Table 1. 


Total number of items inserted 
Total memory size 
Total size of buffers 
Total size of Bloom filters 
Number of incarnations in a super table 
Total flash size 
Average size taken by a hash entry 
Number of hash functions 
Size of a single buffer (=B/n) 
Size of a flash page/sector 
Size of a flash block 


Table 1: Notations used in cost analysis. 


6.1 Insertion Cost 


We now analyze the amortized and the worst case cost 
of an insertion operation. We assume that BufferHash is 
maintained on a flash chip; later we show how the re- 
sults can be trivially extended to SSDs. Based on em- 
pirical results [12], we use linear cost functions for flash 
I/Os—reading, writing, and erasing x bits, at appropriate 
granularities, cost a, + b,-%, Gy + byx, and ae + bx, 
respectively. 

Consider a workload of inserting NV keys. Most inser- 
tions are consumed in buffers, and hence do not need any 
I/O. However, expensive flash I/O occurs when a buffer 
fills and is flushed to flash. Each flush operation involves 
three different types of I/O costs. First, each flush re- 
quires writing n; = | B’/S,,| pages, where B’ is the size 
of a buffer in a super table, and S,, is the size of a flash 
page (or an SSD sector). This results in a write cost of 
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Second, each flush operation requires evicting an old 
incarnation from the incarnation table. For simplicity, 
we consider full discard policy for an evicted incarna- 
tion. Note that each incarnation occupies n; = | B’/S, | 
flash pages, and each flash block has n, = S,/Sp pages, 
where Sz is the size of a flash block. If n; > npg, every 
flush will require erasing flash blocks; otherwise, only 
n;/ny fraction of the flushes will require erasing blocks. 
Finally, during each erase, we need to erase [n;/no| 
flash blocks. Putting all together, we get the erase cost 
of a single flush operation as 


Cy = Min(1, n;/nb) (ae + be[n;/NB] Sd) 


Finally, a flash block to be erased may contain valid 
pages (from other incarnations), which must be backed 
up before erase and copied back after erase. This can 
happen because flash can be erased only at the granu- 
larity of a block and an incarnation to be evicted may 
occupy only part of a block. In this case, p’ = (ny — n;) 
mod ny pages must be read and written during each 
flush. This results in a copying cost of 


C3 =a, + p'b-Sp + Gu +P bwSp 


Amortized cost. Consider insertion of N keys. If 
each hash entry occupies a space of s, each buffer can 
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hold B’/s entries, and hence buffers will be flushed to 
flash a total of ny = Ns/B’ times. Thus, the amortized 
insertion cost is 


Camortized — np (Cy +C2+C3)/N = (C1 +C2+C3)s/B’ 


Note that the cost is independent of NV and inversely 
proportional to the buffer size B’. 

Worst case cost. An insert operation experiences the 
worst-case performance when the buffer for the key is 
full, and hence must be flushed. Thus, the worst case 
cost of an insert operation is 


Cworst = C1 + Co + C3 


SSD. The above analysis extends to SSDs. Since the 
costs Cg and C’s in an SSD are handled by its FTL, the 
overheads of erasing blocks and copying valid pages are 
reflected in its write cost parameters a, and b,,. Hence, 
for an SSD, we can ignore the cost of C2 and C3. Thus, 
we get: Camortized — Cy s/B' and Caserat = C}. 


6.2 Lookup Cost 


A lookup operation in a super table involves first check- 
ing the buffer for the key, checking the Bloom filters to 
determine which incarnations may contain the key, and 
reading a flash page for each of those incarnations to ac- 
tually lookup the key. Since a Bloom filter may produce 
false positives, some of these incarnations may not con- 
tain the key, and hence some of the I/Os may redundant. 

Suppose BufferHash contains n; super tables. Then, 
each super table will have B’ = B/n; bits for its buffer, 
and b' = b/n, bits for Bloom filters. In steady state, 
each super table will contain k = (F/nz)/(B/ni) = 
F/B incarnations. Each incarnation contains n’ = B’/s 
entries, and a Bloom filter for an incarnation will have 
m’ = b'/k bits. For a given m’ and n’, the false positive 
rate of a Bloom filter is minimized with h = m‘ In 2/n’ 
hash functions [19]. Thus, the probability that a Bloom 
filter will return a hit (i.e., indicating the presence of a 
given key) is given by p = (1/2)”. For each hit, we need 
to read a flash page. Since there are c incarnations, the 
expected flash I/O cost is given by 


Cissus = Kp, = k(1/2)"e, 
= PY BAD), 


where c,. is the cost of reading a single flash page from a 
flash chip, or a single sector from an SSD. 


6.3 Discussion 

The above analysis can provide insights into benefits and 
overheads of various BufferHash components that are 
not used in traditional hash tables. Consider a tradi- 
tional hash table stored on an SSD; without any buffer- 
ing, each insertion operation would require one random 
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sector write. Suppose, sequentially writing a buffer of 
size B’ is a times more expensive than randomly writ- 
ing one sector of an SSD. a is typically small even for 
a buffer significantly bigger than a sector, mainly due 
to two reasons. First, sequential writes are significantly 
cheaper than random writes in most existing SSDs. Sec- 
ond, writing multiple consecutive sectors in a batch has 
better per sector latency. In fact, for many existing SSDs, 
the value of a is less than | even for a buffer size of 
256K B (e.g., 0.39 and 0.36 for Samsung and MTron 
SSDs respectively). For Intel SSD, the gap between se- 
quential and random writes is small; still the value of a 
is less than 10 due to I/O batching. 

Clearly, the worst case insertion cost into a CLAM us- 
ing BufferHash for flash is a times more expensive than 
that of a traditional hash table without buffering — a tradi- 
tional hash table requires writing a random sector, while 
BufferHash sequentially writes the entire buffer. As dis- 
cussed above, the value of a is small for existing SSDs. 
On the other hand, our previous analysis shows that the 
amortized insertion cost of BufferHash on flash is at least 
Be times less than a traditional hash table, even if we as- 
sume random writes required by traditional hash table 
are as cheap as sequential writes required by BufferHash 
on flash. In practice, random writes are more expensive, 
and therefore, the amortized insertion cost when using 
BufferHash on flash is even more cheap than that of a 
traditional hash table. 

Similarly, a traditional hash table on flash will need 
one read operation for each lookup operation, even for 
unsuccessful lookups. In contrast, the use of Bloom fil- 
ter can significantly reduce the number of flash reads for 
unsuccessful lookups. More precisely, if the Bloom fil- 
ters are configured to provide a false positive rate of p, 
use of Bloom filter can reduce the cost of an unsuccess- 
ful lookup by a factor of 1/p. Note that the same benefit 
can be realized by using Bloom filters with a traditional 
hash table as well. Even though BufferHash maintains 
multiple Bloom filters over different partitions and in- 
carnations, the total size of all Bloom filters will be the 
same as the size of a single Bloom filter computed over 
all items. This is because for a given false positive rate, 
the size of a Bloom filter is proportional to the number of 
unique items in the filter, 


6.4 Parameter Tuning 


Tuning BufferHash for good CLAM performance re- 
quires tuning two key parameters. First, one needs to 
decide how much DRAM to use, and if a large enough 
DRAM is available, how much of it is to allocate for 
buffer and how much to allocate for Bloom filters. Sec- 
ond, once the total size of in-memory buffers is decided, 
one needs to decide how many super tables to use. We 
use the cost analysis above to address these issues. 
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Optimal buffer size. Assume that the total memory size 
is M bits, of which B bits are allocated for (all) buffers 
(in all super tables) and b = M — B bits are allocated for 
Bloom filters. Our previous analysis shows that the value 
of 6 does not directly affect insertion cost; however, it 
affects lookup cost. So, we would like to find the optimal 
value of 6 that minimizes the expected lookup cost. 

Intuitively, the size of a buffer poses a tradeoff be- 
tween the total number of incarnations and the proba- 
bility of an incarnation being read from flash during 
lookup. As our previous analysis showed, the I/O cost is 
proportional to the product of the number of incarnations 
and the hit rate of Bloom filters. On one hand, reducing 
buffer size increases the number of incarnations, increas- 
ing the cost. On the other hand, increasing buffer size 
leaves less memory for Bloom filters, which increases its 
false positive rate and I/O cost. 

We can use our previous analysis to find a sweet-spot. 
Our analysis showed that the lookup cost is given by C' = 
F/B. (1/2)\“—8)sn2/F . ©. The cost C is minimized 
when dC /dB = 0, or, equivalently d(log2(C))/dB = 0. 
Solving this equation gives the optimal value of B as, 


FP 2F 
Port = s(In2)2 

Interestingly, this optimal value of B does not depend 
on M; rather, it depends only on the total size F' of flash 
and the average space s taken by each hashed item. Thus, 
given some memory of size M > B, we should use 
~ 2F'/s bits for buffers, and the remaining for Bloom 
filters. If additional memory is available, that should be 
used only for Bloom filters, and not for the buffers. 


Total memory size. We can also determine how much 
total memory to use for BufferHash. Intuitively, using 
more memory improves lookup performance, as this al- 
lows using larger Bloom filters and lowering false pos- 
itive rates. Suppose, we want to limit the I/O overhead 
due to false positives to Crarget. Then, we can determine 
b’, the required size of Bloom filters as follows. 


74 b’sln2/F 
Crarget 2 B @ ‘Cr 
a F In s(In 2)?c,. 
7 s(In 2)? Crag: 


Figure 3 shows required size of a Bloom filter for dif- 
ferent expected I/O overheads. As the graph shows, the 
benefit of using large Bloom filter diminishes after a cer- 
tain size. For example, for BufferHash with 32GB flash 
and 16 bytes per entry (effective size of 32 bytes per en- 
try for 50% utilization of hashtables), allocating 1GB for 
all Bloom filters is sufficient to limit the expected I/O 
overhead Carget below Ims. 
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Figure 3: Expected I/O overhead vs Bloom filter size 


Hence, in order to limit I/O overhead during lookup to 
Crarget, BufferHash requires (Bo,+ +5’) bits of memory, 
of which Bo, is for buffers and the rest for Bloom filters. 


Number of super tables. Given a fixed memory size B 
for all buffers, the number of super tables determines the 
size B’ of a buffer within a super table. As our anal- 
ysis shows, B’ does not affect the lookup cost; rather, 
it affects the amortized and worst case cost of insertion. 
Thus, B’ should be set to minimize insertion cost. 


Figure 4 shows the insertion cost of using BufferHash, 
based on our previous analysis, on two flash-based me- 
dia. (The SSD performs better because it uses multi- 
ple flash chips in parallel.). For the flash chip, both 
amortized and worst-case cost minimize when the buffer 
size B’ matches the flash block size. The situation is 
slightly different for SSDs; as Figure 4(b) and (c) show, a 
large buffer reduces average latency but increases worst 
case latency. An application should use its tolerance for 
average- and worst-case latencies and our analytical re- 
sults to determine the desired size of B’ and the number 
of super tables B/D’. 


7 Implementation and Evaluation 


In this section, we measure and dissect the performance 
of various hash table operations in our CLAM design un- 
der a variety of workloads. Our goal is to answer the 
following key questions: 


(1) What is the baseline performance of lookups and in- 
serts in our design? How does the performance com- 
pare against existing disk-based indexes (e.g., the popu- 
lar Berkeley-DB)? Are there specific workloads that our 
approach is best suited for? 


(ii) To what extent do different optimizations in our de- 
sign — namely, buffering of writes, use of bloom filters 
and use of bit-slicing — contribute towards our CLAM’s 
overall performance? 


(zit) To what extent does the use of flash-based secondary 
storage contribute to the performance? 
(itv) How well does our design support a variety of hash 
table eviction policies? 

We start by describing our implementation and how 
we configure it for our experiments. 
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Figure 4: Amortized and worst-case insertion cost on a flash chip and an Intel SSD. Only flash I/O costs are shown. 


7.1 Implementation and Configuration 


We have implemented BufferHash in ~3000 lines of 
C++ code. The hash table in a buffer is implemented 
using Cuckoo hashing [25] with two hash functions. 
Cuckoo hashing utilizes space efficiently and avoids the 
need for hash chaining in the case of collisions. 

To simplify implementation, each partition is main- 
tained in a separate file with all its incarnations. A new 
incarnation is written by overwriting the portion of file 
corresponding to the oldest incarnation in its super table. 
Thus, the performance numbers we report include small 
overheads imposed by the ext 3 file system we use. One 
can achieve better performance by writing directly to the 
disk as a raw device, bypassing the file system. 

We run the BufferHash implementation atop two dif- 
ferent SSDs: an Intel SSD (model: X18-M, which rep- 
resents a new generation SSD), and a Transcend SSD 
(model: TS32GSSD25, which represents a relatively old 
generation but cheaper SSD). 


7.1.1 Configuring the CLAM 


As mentioned in §3, our key motivating applications like 
WAN optimization and deduplication employ hash ta- 
bles of size 16-32GB. To match this, we configure our 
CLAMs with 32GB of slow storage and 4GB of DRAM. 
The size of a buffer in a super table is set to 128KB, as 
suggested by our analysis in 86.4. We limit the utilization 
of the hash table in a buffer to 50% as a higher utilization 
increases hash collision and the possibility of re-building 
the hash table for cuckoo hashing. Also, each hash en- 
try takes 16 bytes of space. Thus, each buffer (and each 
incarnation) contains 4096 hash entries. 

According to the analysis in 86.4, the optimal size of 
buffers for the above configuration is 266MB. We now 
experimentally validate this. Figure 5 shows the varia- 
tion of false positive rates as the memory allocated to 
buffers is varied from 128KB to 3072MB in our proto- 
type. The overall trend is similar to that shown by our 
analysis in 86.4, with the optimal spurious rate of 0.0001 
occurring at a 256MB net size of buffers. The small dif- 
ference from our analytically-derived optimal of 266MB 
arises because our analysis does not restrict the optimal 
number of hash functions to be an integer. 

Note that the spurious rate is low even at 2GB (0.01). 
We select this configuration — 2GB for buffers and 32GB 
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Table 2: A deeper look at lookup latencies. 


for slow storage — as the candidate configuration for the 
rest of our experiments. This gives us 16 incarnations per 
buffer and total of 16,384 buffers in memory. 


7.2 Lookups and Inserts 


We start by considering the performance of basic hash 
table operations, namely lookups and inserts. We study 
other operations such as updates in 87.4. 

We use synthetic workloads to understand the perfor- 
mance. Each synthetic workload consists of a sequence 
of lookups and insertions of keys. For simplicity, we 
focus on a single workload for the most part. In this 
workload, every key is first looked up, and then inserted. 
The keys are generated using random distribution with 
varying range; the range effects the lookup success rate 
(or, “LSR’’) of a key. These workloads are motivated by 
the WAN optimization application discussed in 88. We 
also consider other workloads with different ratio of in- 
sert and lookup operations in 87.2.3. Further, in order to 
stress-test our CLAM design, we assume that keys arrive 
in a continuous backlogged fashion in each workload. 


7.2.1 Latencies 

In Figure 6(a), we show the distribution of latencies for 
lookup operations on our CLAM with both an Intel and 
a Transcend SSD (the curves labeled BH + SSD) . This 
workload has an LSR of approximately 40%. Around 
62% of the time, the lookups take little time (< 0.02ms) 
for both Intel and Transcend SSD, as they are served by 
either the in-memory bloom filters or in-memory buffers. 
99.8% of the lookup times are less than 0.176ms for the 
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Figure 6: CLAM latencies on different media. 


Intel SSD. For Transcend SSD, 90% of the lookup times 
are under 0.6ms, and the maximum is Ims. The Intel 
SSD offers significantly better performance than Tran- 
scend SSD. 

To understand the lookup latencies better, we exam- 
ine the flash I/Os required by a lookup operation in our 
CLAM prototypes, under two different lookup success 
ratios in Table 2. Most lookups go to slow storage only 
when required, i.e., key is on slow storage (which hap- 
pens in 0% of cases for LSR = 0 and in slightly under 
40% of cases for LSR = 0.4); spurious flash I/O may 
be needed in the rare case of false positives (recall that 
BufferHash is configured for 0.01 false positive rate). 
Nevertheless, > 99% lookups require at most one flash 
read only. 

In Figure 6(b), we show the latencies for insertions on 
different CLAM prototypes. Since BufferHash buffers 
writes in memory before writing to flash, most insert ope- 
rations are done in memory. Thus, the average insert cost 
is very small (0.006 ms and 0.007 ms for Intel and Tran- 
scend SSDs respectively). Since a buffer holds around 
4096 items, only | out of 4096 insertions on average re- 
quires writing to flash. The worst case latency, when a 
buffer is flushed to the SSD and requires erasing a block, 
is 2.72 ms and 30 ms for Intel and Transcend SSDs, re- 
spectively. 

On the whole, we note that our CLAM design achieves 
good lookup and insert performance by reducing unsuc- 
cessful lookups in slow storage and by batching multiple 
insertions into one big write. 


7.2.2 Comparison with DB-Indexes 
We now compare our CLAM prototypes against the hash 
table structure in Berkeley-DB (BDB) [6], a popular 
database index. We use the same workload as above. 
(We also considered the B-Tree index of BDB, but the 
performance was worse than the hash table. Results are 
omitted for brevity.) We consider the following system 
configurations: (1) DB+SSD: BDB running on an SSD, 
with BDB recommended configurations for SSDs, and 
(2) DB+Disk: BDB running on a magnetic disk. 

Figures 7(a) and (b) show the lookup and the insert 
latencies for the two systems. The average lookup and 
insert latencies for DB+Disk are 6.8 ms and 7 ms re- 





spectively. More than 60% of the lookups and more than 
40% of the inserts have latencies greater than 5 ms, cor- 
responding to high seek cost on disks. Surprisingly, for 
the Intel SSD, the average lookup and insert latencies are 
also high — 4.6 ms and 4.8 ms respectively. Around 40% 
of lookups and 40% inserts have latencies greater than 5 
ms! This is counterintuitive given that Intel SSD has sig- 
nificantly faster random I/O latency (0.15ms) than mag- 
netic disks. This is explained by the fact that the low 
latency of an SSD is achieved only when the write load 
on the SSD is “low”; 1.e., there are sufficient pauses be- 
tween bursts of writes so that the SSD has enough time 
to clean dirty blocks to produce erased blocks for new 
writes [17]. Under a high write rate, the SSD quickly 
uses up its pool of erased blocks and then I/Os block un- 
til the SSD has reclaimed enough space from dirty blocks 
via garbage collection. 

This result shows that existing disk based solutions 
that send all I/O requests to disks are not likely to per- 
form well on SSDs, even if SSDs are significantly faster 
than disks (1.e., for workloads that give SSDs sufficient 
time for garbage collection). In other words, these solu- 
tions are not likely to exploit the performance benefit of 
SSDs under “high” write load. In contrast, since Buffer- 
Hash writes to flash only when a buffer fills up, it poses 
a relatively “light” load on SSD, resulting in faster reads. 

We do note that it is possible to supplement the 
BDB index with an in-memory Bloom filter to improve 
lookups. We anticipate that, on disks, a BDB with in- 
memory Bloom filter will have similar lookup latencies 
as a BufferHash. However, on SSDs, a BufferHash is 
likely to have a better lookup performance — because of 
the lack of buffering, insertions in BDB will incur a large 
number of small writes, which adversely affect SSDs’ 
read performance due to fragmentation and background 
garbage collection activities. 


7.2.3 Other Workloads 


We evaluate how our CLAM design performs with work- 
loads having different relative fractions of inserts and 
lookups. Our goal is to understand the workloads where 
the benefits of our design are the most significant. Ta- 
ble 3 shows the variation of the latency per opera- 
tion with different lookup fractions in the workload for 
BH+SSD and DB+SSD on Transcend-SSD. 

As Table 3 shows, the latency per operation for BDB 
decreases with increasing fraction of lookups. This is 
due to two reasons. First, (random) reads are signifi- 
cantly cheaper than random writes in SSDs. Since the 
increasing lookup fraction increases the overall fraction 
of flash reads, it reduces the overall latency per operation. 
Second, even the latency of individual lookups decreases 
with increasing fraction of lookups (not shown in the ta- 
ble). This is because, with a smaller fraction of flash 
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Table 3: Per-operation latencies with different lookup 
fractions in workloads. LSR=0.4 for all workloads and 
Transcend SSD is employed. 
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Figure 7: Berkeley-DB latencies for lookups and inserts. 


write I/O, SSDs involve less garbage collection overhead 
that could interfere with all flash I/Os in general. 

In contrast, fora CLAM, the latency per operation im- 
proves with decreasing fraction of lookups. This can be 
attributed to buffering due to which the average insert la- 
tency for a CLAM is reduced. As Table 3 shows, our 
CLAM design is 17x faster for write-intensive work- 
loads than for read-intensive workloads. 


7.3. Dissecting Performance Benefits 

In what follows, we examine the contribution of different 
aspects of our CLAM design on the overall performance 
benefits it offers. 


7.3.1 Contribution of BufferHash optimizations 


The performance of our CLAM comes from three main 
optimizations within BufferHash: (1) buffering of in- 
serts, (2) using Bloom filters, and (3) using windowed 
bit-slicing. To understand how much each of these op- 
timizations contributes towards CLAM’s performance, 
we evaluate our Intel SSD-based CLAM without one of 
these optimizations at a time. 

The effect of buffering is obvious; without it, all inser- 
tions go to the flash, yielding an average insertion latency 
of ~4.8ms at high insert rate i.e. continuous key inser- 
tions (compared to ~0.006ms with buffering). Even at 
low insert rate, average insertion latency is ~0.3ms and 
thus buffering gives significant benefits. 

Without Bloom filters, each lookup operation needs to 
check many incarnations until the key is found or all in- 
carnations have been checked. Since checking an incar- 
nation involves a flash read, this makes lookups slower. 
The worst case happens with 0% redundancy, in which 
case each lookup needs to check all 16 incarnations. Our 
experiments show that even for 40% and 80% LSR, the 
average flash I/O latencies are 1.95ms and |.5ms respec- 
tively without using Bloom filters. In contrast, using 
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Bloom filters avoids expensive flash I/O, reducing flash 
I/O costs to 0.06ms and 0.13ms for 40% and 80% LSR 
respectively and giving a speedup of 10-30. 

Bit-slicing improves lookup latencies by ~20% under 
low LSR, where the lookup workload is mostly memory 
bound. However, the benefit of using bit-slicing becomes 
negligible under a high LSR, since the lookup latency is 
then dominated by flash I/O latency. 


7.3.2 Contribution of Flash-based Storage 

The design of BufferHash is targeted specifically to- 
ward flash-based storage. In this section, we evalu- 
ate the contribution of the I/O properties of flash-based 
storage to the overall performance of our CLAM de- 
sign. To aid in this, we compare two CLAM designs: 
(1) BH+SSD: BufferHash running on an SSD and (2) 
BH+Disk: BufferHash running on a magnetic disk (H1- 
tachi Deskstar 7K 80 drive). We use a workload with 40% 
look-up success rate over random keys with interleaved 
inserts and lookups. 

Figures 6(a) and (b) also show the latencies for 
lookups and inserts in BH+Disk. Lookup latencies range 
from 0.1 to 12ms, an order of magnitude worse than the 
SSD prototypes (BH+SSD) due to the high seek laten- 
cies in disks. The average insert cost is very small and 
the worst case insert cost is 12ms, corresponding to a 
high seek latency for disk. Thus, the use of SSD con- 
tributes to the overall high performance of CLAM. 

Comparing Figures 6 and 7, we see that BH+Disk 
performs better than DB+SSD and DB+Disk on both 
lookups and inserts. This shows that while using SSDs 
is important, it is not sufficient for high performance. It 
is crucial to employ BufferHash to best leverage the I/O 
properties of the SSDs. 


7.4 Eviction Policies 

Our experiments so far are based on the default FIFO 
eviction policy of BufferHash which we implement using 
the full discard primitive (85.1.2). As stated earlier, the 
design of BufferHash is ideally suited for this policy. We 
now consider other eviction policies. 

LRU. We implemented LRU using the full discard prim- 
itive as noted in 85.1.2. Omitting the details of our eval- 
uation in the interest of brevity, we note that the per- 
formance of lookups was largely unaffected compared 
to FIFO; this is because the “re-insertion” operations 
that help emulate LRU happen asynchronously without 
blocking lookups (85.1.2). In the case of inserts, the 
in-memory buffers get filled faster due to re-insertions, 
causing flushes to slower storage to become more fre- 
quent. The resulting increase in average insert latency, 
however, is very small: with a 40%-LSR workload hav- 
ing equal fractions of lookups and inserts, the average in- 
sert latency increases from 0.007ms to 0.008ms on Tran- 
scend SSD. 
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Figure 8: (a) CCDF of insert latencies for the update- 
based policy. Both axes are in log-scale. (b) CDF of the 
number of incarnations tried upon a buffer flush. In both 
cases, the workload has 40% LSR and equal fractions of 
inserts and lookups. 


Partial discard. We now consider the two partial dis- 
card policies discussed in 85.1.2: the update-based pol- 
icy, where only the stale entries are discarded, and the 
priority-based policy, where entries with priority lower 
than a threshold are discarded. We use a workload of 
40% update rate using keys generated by random distri- 
bution. Figure 8(a) shows the CCDEF of insert latencies 
on Transcend and Intel SSDs for the update-based pol- 
icy. We note that an overwhelming fraction of the laten- 
cies remain unchanged, but the rest of the latencies (1%) 
worsen significantly. On the whole, this causes the aver- 
age insertion cost to increase significantly to 0.56ms on 
Transcend SSD and 0.08ms on an Intel SSD. Neverthe- 
less, this is still an order or magnitude smaller than the 
average latency when employing BerkeleyDB on SSDs. 
For priority-based policy, we used priority values equally 
distributed over keys. With different thresolds, we ob- 
tained similar qualitative performance (results omitted 
for brevity). 


Three factors contribute to the higher latency observed 
in the tail of the distribution above: (1) When a buffer is 
flushed to slow storage, there is the additional cost of 
reading entries from the oldest incarnation and finding 
entries to be retained. (2) In the worst case, all entries 
in the evicted incarnation may have to be retained (for 
example, in the update-based policy this happens when 
none of the entries in the evicted incarnation have been 
updated or deleted). In that case, the in-memory buffer 
becomes full and is again flushed causing an eviction of 
the next oldest incarnation. These “cascaded evictions” 
continue until some entries in an evicted incarnation can 
be discarded, or all incarnations have been tried. In the 
latter case, all entries of the oldest incarnation are dis- 
carded. Cascaded evictions contribute to the high inser- 
tion cost seen in the tail of the distribution in Figure 8(a). 
(3) Since some of the entries are being retained after 
eviction, the buffer starts filling up more frequently and 
number of flushes to slow storage increases as a result. 


We find that when buffer becomes full on a key inser- 
tion and needs to be flushed, the overall additional cost of 





insertion operations on Transcend SSD is 17.4ms (on av- 
erage) for the update-based policy. Of this, the additional 
cost arising from reading and checking the entries of 
each incarnation is 1.62ms. Note that this is the only ad- 
ditional cost incurred when an incarnation eviction does 
not result in cascaded evictions. For priority-based pol- 
icy, this cost is lower — 1.48 ms. The update-based policy 
is More expensive as it needs to search Bloom filters to 
see if an entry has already been updated. 

We further find that only on rare occasions do cas- 
caded evictions result in multiple incarnations getting ac- 
cessed. In almost 90% of the cases where cascades hap- 
pen, no more than 3 incarnations are tried as shown in 
Figure 8(b). On average, just 1.5 incarnations are tried 
(i.e., 0.5 incarnations are additionally flushed to slow 
storage, on average). 

Thus, our approach supports FIFO and LRU eviction 
well, but it imposes a substantially higher cost for a small 
fraction of requests in other general eviction policies. 
The high cost can be controlled by loosening the se- 
mantics of the partial discard policies in order to limit 
cascaded evictions. For instance, applications using the 
priority-based policy could retain the top-k high priority 
entries rather than using a fixed threshold on priority. It 
is up to the application designer to select the right trade- 
off between the semantics of the eviction policies and the 
additional overhead incurred. 


7.5 Evaluation Summary 


The above evaluation highlights the following aspects of 
our CLAM design: 

(1) BufferHash on Intel SSD offers lookup latency of 
0.06 ms and insert latency of 0.006 ms, and gives an 
order of magnitude improvement over Berkeley DB on 
Intel SSD. 

(2) Buffering of writes significantly improves insert 
latency. Bloom filters significantly reduce unwanted 
lookups on slow storage, achieving 10-30 improve- 
ment over BufferHash without bloom filters. Bit-slicing 
contributes 20% improvement when the lookup work- 
load is mostly memory bound. 

(3) For lookups, BufferHash on SSDs is an order of mag- 
nitude better than BufferHash on disk. However, SSDs 
alone are not sufficient to give high performance. 


8 WAN Optimizer Using CLAM 


In this section, we study the benefits of using our CLAM 
prototypes in an important application scenario, namely, 
WAN optimization. 
A typical WAN optimizer has three components: 

(1) Connection management (CM) front-end: When 
bytes from a connection arrive at the connection manage- 
ment front-end, they are accumulated into buffers for a 
short amount of time (we use 25ms). The buffered object 
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data is divided into chunks by computing content-based 
chunk boundaries using Rabin-Karp fingerprints [38, 
34]. A SHA-1 hash is computed for each chunk thus 
identified. 

(2) Compression engine (CE): The CE maintains a 
large content cache on a magnetic disk. SHA-1 finger- 
prints of cached content are stored in a large hash table. 
Fingerprints handed over by the CM are looked up in 
the hash table to identify similarity against prior content 
chunks. After redundancy has been identified, the in- 
coming object is compressed and handed over to the net- 
work subsystem (described next). The object’s chunks 
are inserted into the content cache in a serial fashion, 
and SHA-1 hashes for its chunks are inserted into the 
hash table with pointers to the on-disk addresses of cor- 
responding chunks. 

The CE’s hash table can be stored either in a CLAM 

or using BDB on flash. The CLAM is configured with 
4GB RAM and 32GB of Transcend SSD. The CLAM 
implements the full BufferHash functionality, including 
lazy updates with FIFO eviction as well as windowed bit 
slicing. For BDB-based WAN optimizer, we implement 
FIFO eviction from the hash table by maintaining an in- 
memory delete list of invalidated old hash table entries. 
The BDB hash table is also 32GB in size. 
(3) Network sub-system (NS): The NS simply transmits 
the bytes handed over by CE over the outgoing network 
link. In commercial WAN optimizers, the NS uses an op- 
timized custom TCP implementation that can send data 
at the highest possible rate (without needing repeated 
slow start, congestion avoidance etc.) 

In order to focus on the efficacy of CE, we employ two 
simplifications in our evaluation: (1) We emulate a high- 
speed CM by pre-computing chunks and SHA-1 finger- 
prints for objects. (2) To emulate TCP optimization in 
NS, we simply use UDP to transmit data at close to link 
speed and turn off flow and control congestion control. 

In our experiments, we vary the WAN link speed from 
10Mbps to 0.5Gbps. 

Evaluation: We use real packet traces in our eval- 
uation. These traces were collected at University of 
Wisconsin-Madison’s access link to the Internet and at 
the access link of a high volume Web server in the univer- 
sity. From these packet traces we construct object-level 
traces by grouping packets with the same connection 4- 
tuple into a single object and using an inactivity timeout 
of 10s. We also conducted thorough evaluation using a 
variety of synthetic traces where we varied the redun- 
dancy fraction. We omit the results for brevity and note 
that they are qualitatively similar. 

Scenarios. We study two scenarios both based on re- 
playing traces against our experimental setup: 

(1) Throughput test: All objects arrive at once. We 
then measure the total time taken to transmit the objects 
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with and without using our WAN optimizer. The ratio 
of the latter to the former measures the extent to which 
the WAN optimizer helps improve effective capacity of 
the attached WAN link, and we refer to it as the effective 
bandwidth improvement. 

(2) Acceleration under high load: Here, objects arrive 
at a rate matching the link speed; thus, the link is 100% 
utilized when there is no compression. For each object, 
we measure the time difference from object arrival to 
the last byte of the object being sent, with and without 
WAN optimization. In either case, we also measure the 
throughput the object achieves (= effective size/time dif- 
ference). When WAN optimization is used, the time dif- 
ference includes the time to fingerprint the object, look 
for matches and compress the object. In addition, it may 
include delays due to earlier objects (e.g., updating the 
index with fingerprints for the earlier object). Finally, 
we measure the per object throughput improvement as 
the ratio of an object’s throughput with and without the 
WAN optimizer. 


8.1 Benefits of Using CLAMs 


Scenario 1: Figure 9 shows the effective bandwidth im- 
provement using CLAM-based and BDB based WAN- 
optimizers at different link speeds. Both WAN optimiz- 
ers use Transcend-SSD. Figure 9(a) shows the results for 
a high (50%) redundancy trace (1.e., optimal improve- 
ment factor 2). The BDB-based WAN optimizer gives 
close-to-optimal improvement (2 x ) at low link speeds of 
up to 1OMbps. However, at higher link speeds it becomes 
a bottleneck and drastically reduces the effective band- 
width instead of improving it. In comparison, CLAM- 
based WAN-optimizer gives close-to-ideal improvement 
at 10x higher (LOOMbps) link speeds and gives rea- 
sonable improvements even at 200 Mbps. It becomes 
a bottleneck at 400Mbps making its usage obsolete at 
such speeds. Using Intel-SSD, the CLAM-based WAN- 
optimizer can run up to 500 Mbps while offering close 
to ideal improvement, but using Intel-SSD with BDB 
does not improve the situation significantly. A similar 
trend was observed for the low (15%) redundancy trace 
(1.e., optimal improvement factor 1.18) whose results are 
shown in Figure 9(b). In this case, CLAM-based WAN- 
optimizer is able to operate at even higher link speeds 
while giving close to ideal improvement. This is because, 
when redundancy is low, lookups in the case of CLAMs 
seldom go to flash, which results in higher throughput. 
Scenario 2: We fix the link speed to be 10Mbps 
for this analysis, because BDB is ineffective at higher 
speeds. We use a trace with 50% redundancy. We now 
take a closer look at the improvements by CLAMs and 
BDB. Figures 10 (a) and (b) show the relative through- 
put improvement on an object-by-object basis (Only im- 
provements up to factor of 2 is shown). We see that 


USENIX Association 


Effective bandwidth 
improvement factor 


Throughput Improvement 
Factor 


USENIX Association 


ao 
ow 





Bufferhash + SSD(Transcend) sxxcoa 
rBerkeleyDB + SSD(Transcend) 


Bufferhash + SSD(Transcend) 
rBerkeleyDB + SSD(Transcend) 


po 
on 
po 
on 


‘_, 

o 6p 
<] 

om Pp 


— 
21 


Effective bandwidth 
improvement factor 


So 
on 
So 
on 











oO 
Oo 


10 20 100 200 300 400 10 20 100 200 300 400 
(a) Link bandwidth (Mbps) (b) Link bandwidth (Mbps) 


Figure 9: Effective capacity improvement vs different 


link rates for (a) 50% and (b) 15% redundancy traces. 
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Figure 10: Heavy load scenario: Throughput improve- 
ment per object for (a) BufferHash-based CLAM using 
Transcend SSD and (b) Berkeley-DB on Transcend SSD. 


Berkeley-DB has a negative effect on the throughputs 
of a large number of objects (compared to ideal), es- 
pecially objects SOOKB or smaller; their throughput is 
worsened by a factor of two or more due to the high costs 
of lookups and inserts (the latter for fingerprints of prior 
objects). Our CLAM also imposes overhead on some 
of these objects, but this happens on far fewer occasions 
and the overhead is significantly lower. Also, the average 
per-object improvement is 3.1 for our CLAM, which is 
65% better than BDB (average improvement of 1.9). 


9 Conclusions 


We have designed and implemented CLAMs (Cheap and 
Large CAMs) for high-performance content-based net- 
worked systems that require large hash tables (up to 
100GB or more) with support for fast insertion, lookups 
and updates. Our design uses a combination of DRAM 
and flash storage along with a novel data structure, called 
BufferHash, to facilitate fast hash table operations. Our 
CLAM supports a larger index than DRAM-only solu- 
tions, and faster hash operations than disk- or flash-only 
solutions. It can offer a few orders of magnitude more 
hash operations/s/$ than these alternatives. We have in- 
corporated our CLAM prototype in a WAN optimizer 
and showed that it can enhance the benefits significantly. 

Our design is not final, but it is a key step toward sup- 
porting high speed operation of modern data-intensive 
networked systems. It may be possible to design bet- 
ter CLAMs by leveraging space-saving ideas from recent 
systems such as FAWN [13] (to help control the amount 


of DRAM needed by BufferHash), using coding tech- 
niques such as floating codes (for better eviction sup- 
port), or by using newer memory technologies such as 
Phase Change Memory (which can support much better 
read/write latencies than flash). 
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